[00:02:59] 10Analytics-Tech-community-metrics, 07Regression: Only display organizations defined in Wikimedia's DB (disable assuming orgs via hostnames in email addresses) - https://phabricator.wikimedia.org/T161308#3170152 (10Albertinisg) After reading the comments I've removed several organizations based in the previous... [00:06:50] 10Analytics-Tech-community-metrics: "Email senders" widget empty though "Mailing Lists" widget states that there are email senders - https://phabricator.wikimedia.org/T159229#3061153 (10Albertinisg) We updated the profiles of every identity based on the data source. The issue we had here was an empty field, so t... [00:25:30] 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3170220 (10Albertinisg) > I merged https://github.com/Bitergia/mediawiki-identities/commit/50ab30725ea9d6eb03487bb9eb1849965a4d8d1c on 2017-02-09 for T157... [06:20:43] joal: o/ [06:22:04] my first read of the emails was not good this morning, too many errors :) [06:22:17] but reading the backolog seems that there were only suspended jobs? [06:22:27] not sure if due to me switching the masters [06:38:31] or maybe the new nodes added? [06:40:39] brb [07:22:10] trying to fix analytics1064 and analytics1068 [07:22:16] (new nodes!!) [07:22:32] I still need to reboot some hadoop workers (~6 IIRC) for the new kernels [07:22:39] so let me know if I can do it or not :) [09:35:56] I am installing 1064 and 1068, there were some dns config issue but now are solved [10:02:35] going afk for early lunch + running errand (~1hour) [10:02:38] ttl! [11:01:38] Hi elukey - Please excuse me I got a late start today [11:02:29] joal: hello! I had an interesting issue to solve with dns for 1064 and 1068, no problem at all :) [11:02:39] good for you if I reboot a couple of workers? [11:02:52] please go ahead, nothing special running currently [11:03:33] elukey: We don't know what were yesterday stuff (new workers or master switch or anything else that might have caused the alarms), but there some jobs suspended [11:04:27] elukey: And something to keep in mind when working with suspended jobs is to make your one resumes corrdinators AND workflows [11:04:51] elukey: Without nuria's check on workflow, I'd have woke this morning with a lot of late jobs ! [11:04:59] yep yep :( [11:05:14] PROBLEM - Hadoop DataNode on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:05:21] hello 1068! [11:06:14] RECOVERY - Hadoop DataNode on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:06:30] there you go [11:06:36] 1064 and 1068 up and running [11:06:56] Man this is awesome :) [11:07:33] mmm 1068 has some problems with partitions, fixing it [11:18:14] PROBLEM - Hadoop DataNode on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:20:24] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:27:24] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:28:14] RECOVERY - Hadoop DataNode on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:30:03] ok it should be good now :) [11:36:01] 10Analytics-Tech-community-metrics, 06Developer-Relations (Apr-Jun 2017): "Email senders" widget empty though "Mailing Lists" widget states that there are email senders - https://phabricator.wikimedia.org/T159229#3170923 (10Aklapper) 05Open>03Resolved Yay! Thanks a lot! :) ([[ https://github.com/grimoirel... [11:39:28] team: draining analytics10[44,45,46] before rebooting [11:39:43] then I'll do 47,48,49,50 [11:39:53] and 4.9 will be installed everywhere [11:54:23] now draining analytics[1047-1050].eqiad.wmnet [11:54:30] I'll reboot them in a bit [12:01:56] joal: on analytics1049 I have two spark containers with your username in some paths :P [12:02:43] elukey: NUKE'EM'ALL ! [12:03:30] okkk [12:12:44] all workers running linux 4.9 [12:20:01] https://racktables.wikimedia.org/index.php?page=rack&rack_id=2104 [12:20:08] we have two kafka nodes on the same rack -.0 [12:24:52] https://racktables.wikimedia.org/index.php?page=object&object_id=1566 [12:24:56] same thing in here.. [12:25:06] 4 kafka nodes in two racks [12:25:40] 10Analytics, 10Analytics-Cluster, 06Operations, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3171078 (10faidon) @Ottomata @RobH this seems to have been stalled somewhere between you two. Could you guys figure this and and T159839 out this week? Thanks! [12:41:13] elukey: if you have a minute: https://wikitech.wikimedia.org/wiki/Analytics [12:41:53] joal: yes I like the Analytics team, I confirm [12:42:00] * elukey runs aaway [12:43:19] nice page! [12:49:16] elukey: there is a lot to add for improvement, but I prefer it now :) [12:59:40] ottomata: hhhhhhhhhhhhhhiiiiiiiiiiiiiiIIIIIIIIIIIIIIIiiiiiiiiiiiiiiiiiiiiiiiii [13:00:32] hiiiii [13:00:46] sorry about those a68 alarms, i hadn't run puppet at all there yet so i wasn't expecting those [13:00:49] (still reading emails....) [13:00:58] no no it is my fault [13:01:01] ottomata: Hulllllo ! [13:01:13] I did something wrong with partitions and hadoop was complaining [13:01:26] ottomata: It's the python-package-bothering-guy speaking :D [13:01:27] 64 and 68 are up and running :) [13:02:05] ottomata: one thing to discuss - 4 analytics kafka nodes are on two racks only (2 in one rack, 2 in another one) [13:02:34] elukey: that sounds good, right? [13:02:51] 2, 2 and 2? [13:03:32] ottomata: I'd prefer six different racks to be honest :D [13:03:55] elukey: afaik racks don't matter too much, at least when it comes to networking [13:03:58] does matter for power [13:04:08] yep this is my concern.. [13:04:10] but, they are spread against 3 rows at least, right? [13:04:21] should be yes [13:04:33] i guess when we order new brokers we can ask for 6 different racks [13:04:41] joal: , whatts up? [13:04:42] but Arzhel pinged me today since they need to do some work on Row D and stop some racks [13:04:58] oh [13:04:59] hey ottomata, have you installed the packages on statX? [13:05:04] precisely https://racktables.wikimedia.org/index.php?page=rack&rack_id=2104 [13:05:05] joal: naw haven't done that, sorry [13:05:11] sp we [13:05:13] sorry [13:05:14] will do shortly [13:05:18] np ottomata, was just double checking :) [13:05:23] so we'll lose kafka1020 and 1018 [13:05:26] thanks for the poke [13:06:01] it might be a good occasion to move say kafka1020 to another rack before the maintenance [13:06:04] elukey: what was wrong with 1064, and how did you fix partitioning on 1068? [13:06:13] oh, dns? [13:06:22] niiiice [13:06:25] nm, i see your commit [13:06:31] ottomata: it took me a while and Riccardo helped [13:06:32] https://gerrit.wikimedia.org/r/#/c/347577/ [13:07:17] puppet cert --list was returning weird requests for wnmet etc.. [13:16:39] joal: are the python3 packages we are installing all of the python dependencies for ores in general? [13:16:52] or are we missing some, and this is just for a specific special use case of ores code? [13:17:06] ottomata: nope, ORES needs more, and I load other stuff manually in spark [13:17:14] ok [13:17:37] ottomata: However having the list preinstalled allows for virtualenv to be easier to setup (less stuff to inxstall) [13:18:05] k [13:18:40] hey guys sorry if I haven't answered your email about NLTK but I wanted to ask a bit more details since I have no context :( [13:22:07] elukey: so the stuff joseph needs are just a few text files on all the worker nodes [13:22:20] the python nltk code instructions just say to download it [13:22:26] when you want to use it [13:22:30] which, we can't do [13:22:42] so we need a way to get that data (and maybe other data? not sure) onto all the workers [13:22:49] so its the usual question: [13:22:56] deb package or puppet or scap3? [13:23:38] i was thinking that if we are going to have to mainatin more than just nltk specific data, and have ores specific stuff [13:23:50] maybe we should have a git repo for this, where we can more easily put whatever we want in it [13:23:56] and it can be deployed with scap3 [13:27:00] +1 [13:28:12] joal: we can't do these packages on stat100[24] [13:28:13] they are trusty :) [13:28:16] not jessie [13:28:30] Arf ... [13:28:33] elukey: we need to add reimage of stat1004 to our list of nodes [13:28:39] it should be easy, since it is only an 'analytics client' [13:28:41] not a statistics client [13:28:52] ottomata: I'm gonna try from an1042 then, with your permission [13:28:54] stat100[23] we should jsut wait for new nodes [13:28:57] joal: +1 [13:29:13] ottomata: sure, we can do stat1004 this week if you want [13:29:15] i think we can reimage stat1004 soon enough [13:29:15] yeah [13:29:24] i think not many people use it... [13:30:38] elukey: let's do it now! ? [13:30:39] :) [13:30:42] i can drive it [13:32:49] ahhaha [13:32:49] sure [13:32:55] the only reason we didn't install it jessie when we got it was that we didn't have cdh jessie packages then [13:32:56] k! [13:33:17] ottomata: let's check 'last' just to be sure :) [13:33:23] we also need to backup homes [13:33:29] k [13:36:47] ottomata: there some dictionary packages I forgot to add for languages (my bad) [13:38:04] maybe check stat1004 for any custom builds of package not from CDH or Ubuntu, these might need to be rebuilt for jessie? [13:38:22] custom builds of packages moritzm? [13:38:40] ottomata: https://gist.github.com/jobar/51669c4dc6bb63cc1860f55015cf70b3 [13:38:57] ottomata: all those exist for jessie, currently installing them on h-1 [13:39:11] ottomata: sorry for not noticing earlier [13:39:18] moritzm: i don't think so, because we don't really include anything on stat1004 that isn't on hadoop workers, which we are already installing jessie with [13:39:22] ottomata: what I mean; if there's any packages running on stat1004 which you built for trusty-wikimedia in the past and which might need a rebuild for jessie-wikimedia [13:39:28] ok [13:39:30] oh ok joal [13:39:36] ah ok [13:39:40] yeah, there shouldn't be [13:40:08] just checked, we should be fine [13:52:36] ok elukey stat1004 /home backed up as tar in /srv, and also copied to stat1002 [13:53:14] super [13:53:26] there's nothing in /srv that needs saving, afaict, its all puppetized [13:53:32] so i think we can let partman do its thing [13:53:39] then we can restore /home [13:53:40] that should be it [13:53:49] +1 [13:53:52] i've silenced icinga alarms [13:53:54] proceeding [13:54:09] wmf-auto-reimage does silence icinga too [13:54:11] :) [13:54:26] (and depools if necessary) [13:54:35] 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3171426 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['stat1004.eqiad.wmnet'] ``` The... [13:55:00] elukey: it does?! oh! [13:55:02] amazing, didn't realize that [13:56:42] yes it is! [14:01:43] (03PS1) 10Joal: [WIP] Add oozie jobs loading uniques in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347611 (https://phabricator.wikimedia.org/T159471) [14:02:22] ottomata, elukey - Is stat1004 ready or not yet? [14:03:14] nono I think that Andrew just started [14:03:44] joal: i'm reimaging it now [14:03:50] k [14:06:30] (03PS2) 10Joal: [WIP] Add oozie jobs loading uniques in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347611 (https://phabricator.wikimedia.org/T159471) [14:09:29] o/ ottomata & joal [14:09:35] hiiii [14:09:38] FYI, we maintain a repo of python wheels for ORES. [14:09:49] https://phabricator.wikimedia.org/diffusion/1915/ [14:09:59] It works pretty nicely for keeping an environment versioned. [14:10:02] ottomata: I just merged the stat1004 change on puppetmaster1001 [14:10:13] Not sure if that is relevant to the conversation I saw above or not. [14:10:15] did you already start the reimage? [14:11:22] halfak: I'm having problem with nltk_data :( [14:11:27] elukey: oh no! [14:11:35] i did. [14:11:39] I just ran puppet on install1002 :( [14:11:48] poop scoops, i ran puppet on install1002 but didn't watch output [14:11:51] so i forgot to emrge [14:11:55] ok will reimage again... [14:11:59] it got stuck on partitioning anyway [14:12:03] had to confirm no swap needed [14:12:23] ah we can fix the partman config if we want [14:12:27] should be oneliner [14:13:16] joal, gotcha. They do things weird. [14:13:25] I also have nltk data checked into that repo with the wheels. [14:14:09] joal: can we just deploy the wheels repo with scap3? [14:14:15] halfak: wich makes us make weird things :) [14:14:44] instead of worrying about these packages? [14:14:49] halfak: this wheels repo is for jessie, ja? [14:14:57] ottomata: possibly yes, that'd be great (even preventing me to load them using py-files) [14:17:45] ottomata, yup [14:17:59] but it's pretty easy to rebuild for another install. [14:18:23] ottomata: looks like we've gone with the packages fight for nothing :( [14:18:26] But yeah, standard jessie build in prod and labs [14:18:32] haha, maybe so [14:18:34] ottomata: I apologize deeply [14:18:37] we shoudla thought of that earlier [14:18:42] figured out how they ran in prod [14:19:24] yeah [14:19:34] * joal faceplams [14:19:57] Sorry to be helpful late :/ [14:23:14] halfak: so for puppet, what do we need? ores::base does some pacakages [14:23:41] ores::base should allow you install ORES from wheels and our deploy repos. [14:24:07] If you're using ORES in analytics mode, then ores::base will allow you to set up your virtualenv and take it from there. [14:24:28] If you want celery and uwsgi running, you'll need other roles. [14:24:39] how do you deploy the wheels? [14:24:55] halfak: we need to run ores in non-virtual-env mode - will it be ok? [14:25:18] joal: why does it have to be non venv mode? [14:25:30] They are a submodule of the deploy repo. We do a 'git submodule update --init' after pulling fresh code in the deploy repo and then do a 'pip install *.whl --no-deps' [14:25:33] ottomata: because pyspark doesn't do virtual end [14:25:56] joal, theoretically. But that means everyone's going to need to use our versions of things. [14:25:57] ottomata: pyspark launches python per worker [14:26:02] halfak: do you use scap3? [14:26:06] ottomata, yeah [14:26:26] i'm missing that, i dont' see any scap::target ores in puppet [14:26:34] https://phabricator.wikimedia.org/diffusion/1880/browse/ [14:26:42] https://phabricator.wikimedia.org/diffusion/1880/browse/master/scap/ [14:27:15] ottomata, sorry, not that familiar with scap::target usage. [14:27:39] don't your targets need configured to accept a scap3 deployment or ores? [14:28:33] maybe? It seems that we push code to the targets using scap from deployment.eqiad.wmnet, but maybe not? [14:28:44] hm [14:28:52] yeah, maybe it lets you because you are using scb [14:28:56] I'm honestly not that clear on how scap goes about delivering code once I've asked it to sync. [14:29:01] and scb already has some scap target config [14:29:05] could be [14:30:15] ok, if this works, can we add a scap environment for analytics to your deploy repo? [14:31:04] ottomata, hmm... not sure what that entails, but it doesn't sound too crazy. [14:31:24] ottomata: would scap care of installing the whells correctly? [14:31:34] k, ya it should be fine and not affect your normal deploys [14:31:43] that way we don't have to set up our own deploy repo [14:32:00] joal: i'm not sure how wheels works at all [14:32:04] i think [14:32:16] that halfak builds locally...(in a docker maybe?) [14:32:17] last halfak: does your code have the non-C-bound parser from hell ? If not I'll need to add it [14:32:20] that then commits the deps to this repo [14:32:39] ottomata, I build on a labs instance [14:32:47] With ores::base [14:32:53] ok [14:32:59] joal, not aware of that option at all. [14:33:04] but ya, all deps are committed as wheels to this repo [14:33:09] so, as long as we include ores::base on analytisc nodes [14:33:12] and then deploy this repo there [14:33:13] it should work [14:33:15] ja? [14:33:34] the venv shoudln't be necesssary...i think..as long as python path is set up joal, when you want to run [14:33:36] but i'm not sure about that [14:33:43] halfak: you told me about it :) Building MWParserFromHell not-using C bindings [14:34:05] Must have researched it and promptly forgot :S [14:34:13] i haven't really worked with wheels before [14:34:13] np halfak [14:34:17] halfak: , if we do something lke [14:34:34] export PYTHONPATH=/srv/deployment/ores/deploy/submodules/wheels [14:34:37] ottomata, wheels are pretty basic. They are practically just zip files and pip has a protocol for moving the stuff to the right spot. [14:34:39] it should be able to load up deps? [14:34:48] ottomata, you can tell pip where to do that when you install. [14:34:59] pip? [14:35:07] we have to pip install on the target? [14:35:40] ottomata: I think halfak thinks we're gonna pip install things - The idea os to not do that hl [14:35:54] joal, pip works in offline mode too [14:35:56] halfak: is not talking about pip installing deps [14:35:57] yeah [14:35:57] When you give it wheels [14:36:06] i think they use it for installing into a venv [14:36:11] right [14:36:18] but, we don't want to pip install at all, because we can't use venv, and we don't want to install globally [14:36:22] No network connection necessary [14:36:32] pip installs into whatever you want [14:36:35] hm [14:36:40] ottomata: we could pip install to some /usr/local ? [14:36:54] pip is just a nice way to handle packages and wheels. It happens to default to system folders and using the internet [14:37:17] joal: no [14:37:26] we could pip install into /srv somewhere though :) [14:37:47] as long as we don't run as super user, and it is automated by scap, i'm ok with it [14:37:50] don't love it though [14:37:56] i don't understand why the pip install is needed [14:38:01] if all the deps are deployed already [14:38:02] ottomata: I don't know how I could tell spark to use another folder for wheel loading [14:38:07] seems like pythonpath could be modified [14:38:36] ottomata: I think pip is doing some unpacking [14:38:41] joal: I think you should be able to load py modules from anywhere as long as sys.path is setup properly [14:38:44] oh [14:38:46] maybe so [14:38:53] yeah that i don't know, not having worked with wheels [14:39:05] is python itself not able to load up wheels as modules? [14:39:10] on a really unrelated question, is elukey around? [14:39:17] when he will be? [14:39:31] ottomata: I think wheels need to be installed [14:40:22] Amir1: o/ [14:40:22] ottomata, wheels need to be extracted to be used. [14:40:35] Extracting runs no arbitrary code FYI [14:40:41] hm [14:40:43] hey elukey https://gerrit.wikimedia.org/r/#/c/347395/ [14:40:49] ok i guess that's fine then [14:40:54] as long as we can specify where it should go [14:41:02] ya'll should keep in mind we run this in prod :P [14:41:07] I want to deploy this but I was told that I need to check with Ops specially on redis instances [14:41:16] we'd never get away with running arbitrary code during deployment [14:41:21] yaya, it snot arbitrary code [14:41:24] its the venv prb [14:41:26] prob [14:41:30] i guess spark can't do it [14:41:31] because we are changing our lock managers from Sql DB to redis [14:41:34] so we need to figure out how to it manually [14:41:49] Yeah... so that. If spark knows classpath, then it should'nt be a problem [14:41:51] and I was told that you worked on these instances recently [14:41:55] we used to do global (no arbirary code deps) pip install for eventlogging [14:41:56] and it sucked [14:41:58] so [14:41:59] instead [14:42:02] Also, the local folder can be added to the search path easily. [14:42:03] elukey: Is there anything we need to keep in mind? [14:42:03] deps are all deb packages [14:42:12] and the services run directly out of /srv/deployment/eventlogging/... [14:42:21] where is PWD for spark? [14:42:22] by setting PYTHONPATH in systemd init script [14:42:27] Can't I just install there? [14:42:58] not sure about that, since spark is a job launched by hadoop [14:43:03] halfak: given that workers change all the time, PWD changes all the time and is temporary [14:43:21] Amir1: I am not super expert in Redis to give you a precise judgement.. are we already using it elsewhere? [14:43:27] joal, maybe there's a PWD setup operation we could tie into? [14:43:39] Amir1: (let's move to the ops chan) [14:43:39] elukey: for Wikidata no [14:43:44] sure [14:45:29] halfak: for wheels only I can pass them to my job manually - problem is with nltk_data [14:45:31] oh elukey i think the boot didnt' wait long enough for raid again [14:45:33] how did we fix that before? [14:46:06] rootdelay=30 in edit mode when "Debian" pops up during boot :) [14:46:20] nltk_data can't be passed manually? In the case of that data, it can be placed anywhere and just used. [14:46:47] http://stackoverflow.com/questions/3522372/how-to-config-nltk-data-directory-from-code [14:47:53] joal: i'm reading that spark works with venv [14:47:55] http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/ [14:47:56] halfak: still means we need that data in a known and accessible place in every worker [14:48:04] Once you have a virtualenv setup in a uniform location on each node in your cluster, you can use it as the Python executable for your Spark executors by setting the PYSPARK_PYTHON environment variable to /path/to/mynewenv/bin/python. [14:48:27] ottomata: ooooh ! That's rgeat news ! [14:48:35] perfect :) [14:49:16] ottomata: this means preinstall everythiong using scap on every worker + stat1004 -correct? [14:49:33] ja, we'd make a special scap env for analytics that targets everything [14:49:34] deploy [14:49:46] and then you should be able to just set that env var in pyspark to the venv path that gets deployed [14:50:02] ottomata: also, I think virtualenv is not installed on stat1004 ;) [14:50:08] halfak: , scap deploy takes care of the venv creation and pip install on the target, right? [14:50:16] joal: it will be by ores::base puppet class :) [14:50:30] ottomata: ok great [14:50:41] ottomata, right. [14:52:27] elukey: not sure if i'm in the right stop [14:52:28] spot [14:52:49] batcave real quick? [14:53:01] ottomata: I can log in the console and check if you want! [14:53:03] will take me 1 min [14:53:05] ok [14:53:23] i'm out [14:53:30] its on the boot select screen [14:53:32] i hit 'e' to edit [14:53:36] but it didn't look familiar [14:55:34] done! [14:55:36] but I can see [14:55:37] [ 35.554790] ata1.00: cmd 60/80:00:00:e9:a0/00:00:00:00:00/40 tag 0 ncq dma 65536 in [14:55:40] [ 35.554790] res 40/00:50:00:ee:a0/00:00:00:00:00/40 Emask 0x10 (ATA bus error) [14:55:43] that are not super good [14:56:11] 10Analytics-Dashiki, 06Analytics-Kanban: annotations should show on tab layout - https://phabricator.wikimedia.org/T162482#3171597 (10Nuria) a:03mforns [14:56:32] halfak: joal, is this nltk data also already included in the ores scap deploy? [14:56:39] uh oh [14:56:45] ottomata: it is ! [14:57:08] ottomata: looks like everything will been taken care of if we have venv setup correctly [14:57:16] \o/ [14:59:17] joal: just checking, hvae you read http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/ ? [14:59:34] i think that won't totally work for us, because some of the prod ores deps are deb packages (scipy i think?) [14:59:36] ottomata: absolutely not - I'm not used to run pyspark at all [14:59:57] not sure if this is better or not [15:00:10] ottomata: from what I understood with halfak, the repo contains all deps - correct halfak l? [15:00:20] but it seems that article says it is possible to distribute a zipped venv to pyspark with --archives [15:00:33] ottomata, joal: standddupppp [15:00:37] which would mean we wouldn't have to deploy to all worker nodes [15:00:49] ottomata: willl read [15:01:14] joal, that's right. [15:01:37] ottomata, the worker nodes will need some things from ores::base [15:01:56] But if we can shuffle around the venv and nltk_data, that should work just fine. [15:02:34] For an example of using archives to access a venv, see https://github.com/halfak/measuring-edit-productivity/blob/master/hadoop/revdocs2diffs.hadoop [15:02:58] Note how the script actually sets up the venv, transfers it to HDFS, and then references it from the job. [15:03:25] I didn't do nltk data with that job, so I'm not sure how that's going to work. [15:07:31] 10Analytics, 10Analytics-Cluster, 06Operations, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3171675 (10RobH) >>! In T159838#3171078, @faidon wrote: > @Ottomata @RobH this seems to have been stalled somewhere between you two. Could you guys figure this and an... [15:21:47] Greetings. Any hints about how to set a notification when a user connects from a given country? [15:24:05] 06Analytics-Kanban: Install npl pyspark python packages on hadoop - https://phabricator.wikimedia.org/T162706#3171850 (10Nuria) [15:40:22] 06Analytics-Kanban: Spark + ORES in Hadoop - https://phabricator.wikimedia.org/T162706#3171933 (10Ottomata) p:05Triage>03Normal [15:41:01] 06Analytics-Kanban: Spark + ORES in Hadoop - https://phabricator.wikimedia.org/T162706#3171850 (10Ottomata) See also: http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/ http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/ [16:01:13] oh, elukey did you finish stat1004? [16:01:38] ottomata: nope, I was about to ask you [16:01:45] it seems puppet has run! [16:01:48] and it is now jessie [16:02:15] gooood! [16:02:20] there is that weird error in the dmes [16:02:24] *dmesg [16:02:27] you didn't sign the cert? [16:02:29] that we might want to investigate [16:02:47] nono wmf-reimage should do it (or did you reimaged manually?) [16:03:27] 10Analytics-Dashiki, 06Analytics-Kanban: annotations should show on tab layout - https://phabricator.wikimedia.org/T162482#3172024 (10mforns) @Milimetric Sounds good! Will try that. It seems to me like this way would work better for long-duration anomalies, as opposed to spikes, no? [16:03:40] well, i cancelled wmf reimage when we realized it was installing trusty [16:03:46] and then, beacuse it had alreday started installing [16:03:50] i couldn't use wmf-reimage [16:03:53] so i just did it manually [16:03:56] pxe boot + powercycle [16:06:02] ottomata, elukey: There are eratic spark job failures we've been missing for days [16:06:12] just found them [16:07:15] 10Analytics-Dashiki, 06Analytics-Kanban: annotations should show on tab layout - https://phabricator.wikimedia.org/T162482#3172054 (10Milimetric) Hm, it might be tricky for ranges, because ranges right now are two annotations. So you'd have to do some CSS magic to get whatever you sandwich between two divs to... [16:07:31] mforns: happy to help with whatever CSS incantations you find necessary [16:08:07] joal: ? [16:08:31] ottomata: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0001741-170228165458841-oozie-oozi-C/ [16:08:34] :( [16:11:00] joal: did they timeout or osmething? [16:11:14] ottomata: couldn't actually find useful info [16:11:24] gonna hit rerun on that apr 9 20:00 one [16:11:26] see what happens [16:11:57] elukey: , home restored on stat1004 [16:12:15] elukey: do you know what that dmesg bus error means? [16:13:45] ottomata: already reran many [16:14:16] oh? [16:14:16] ottomata: no idea :/ [16:14:25] i didn't succeed reruning [16:14:48] Error: E1018 : E1018: Coord Job Rerun Error: part or all actions are not eligible to rerun! [16:15:09] weird [16:18:42] going afk a bit earlier today, will check later on IRC :) [16:21:00] [Tue Apr 11 14:54:51 2017] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [16:21:03] [Tue Apr 11 14:54:51 2017] ata1.00: NCQ Send/Recv Log not supported [16:21:06] [Tue Apr 11 14:54:51 2017] ata1.00: NCQ Send/Recv Log not supported [16:21:07] this one is also not super good [16:21:10] [Tue Apr 11 14:54:51 2017] ata1.00: configured for UDMA/133 [16:21:12] [Tue Apr 11 14:54:51 2017] ata1: EH complete [16:21:15] [Tue Apr 11 14:54:51 2017] ata1: limiting SATA link speed to 3.0 Gbps [16:22:45] ottomata: reading over the internetz seems to indicate a SATA cable to replace or power supply [16:23:14] not sure if those errors were present before though [16:23:29] we are now running linux 4.9 (we did a big jump from 3.x) [16:23:34] aye [16:23:57] joal: i don't see much in logs either [16:23:59] things like oozie.log-2017-04-11-14:2017-04-11 14:50:41,967 WARN CoordActionReadyXCommand:523 - SERVER[analytics1003.eqiad.wmnet] USER[hdfs] GROUP[-] TOKEN[] APP[restbase-coord] JOB[0001741-170228165458841-oozie-oozi-C] ACTION[] No actions to start for jobId=0001741-170228165458841-oozie-oozi-C as max concurrency reached! [16:24:02] but, that seems ok [16:24:49] hmm [16:24:49] org.apache.oozie.command.CommandException: E0606: Could not get lock [coord_status_transit_357f8378-3361-4ef3-bce9-50bbf326e71c], timed out [0]ms [16:24:56] ottomata: stat1004 seems fine, if you don't find anything I'll open a phab task tomorrow for Chris [16:25:24] ok [16:25:26] 06Analytics-Kanban, 13Patch-For-Review: Pagecounts all sites data issues - https://phabricator.wikimedia.org/T162157#3154040 (10ezachte) I found this in DammitSummarizeProjectviews.pl: # quick fix: fake counts for meta for a period where we had > 10 billion hits on meta due to fundraiser artefact, all to wiki... [16:25:33] its operating normally afaikt [16:25:34] ct [16:25:42] it seems so yeah [16:28:22] * elukey afk! [16:28:24] o/ [16:34:29] nuria: https://pivot.wikimedia.org/#joal-test-unique-devices-monthly/line-chart/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Azs6RpbutnsEwGZVyxALbVeAfQlhSY4AF9kwYhBkc86FWs7AKVOoxZt1XTDnxEylJQaat2nbub7VBS4XPyVFy1Q+Z4ANqafibAMmIAYgA2GBgMVAAmAK4MxNq8AAoAjACaCmi+GfgR1ABKxOQA5uJKKWnFwDn5Ssxg1OYAtNnyALryA8jBNPR2xo5mvAJCouL4UqUFwABGyRAA1tRgAIKhE1oOvKUAQmubYEmp6Yf4OQDqS8zxO3saRTfATwwXNqNGN04pq4Zp [16:34:36] 5gAtZOIfIlSExXvhiMwINRyNDqgo+shyMlotElMlyKQAI7JFgSFgRdK1QLAfFEknMMltWSUqi04mkjD8fjMLasgnshn4xLUOwU9rAXpDSVYnFAA [16:34:39] aouch [16:34:39] joal: OHHHHHH [16:34:41] sorry [16:34:50] joal: do not worry, i am imagin it [16:34:53] *imaging [16:35:14] nuria: daily works great, monthly doesn't for an interestingly unexpected reason ! [16:35:27] joal: on meeting , can talk in a bit [16:35:30] sure [16:35:38] I have other things to do in the meantime [16:36:05] ottomata: I'm following a bunch of relaunch spark rest-base jobs [16:36:12] ottomata: I'll try to understand moer [16:36:28] ottomata: in the meantime, I also will add emails from those jobs, to at least get alertred [16:36:46] (03PS3) 10Joal: [WIP] Add oozie jobs loading uniques in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347611 (https://phabricator.wikimedia.org/T159471) [16:37:57] 10Analytics-Tech-community-metrics, 07Regression: Only display organizations defined in Wikimedia's DB (disable assuming orgs via hostnames in email addresses) - https://phabricator.wikimedia.org/T161308#3172186 (10Aklapper) @Albertinisg: Thanks for looking into this. Would it be possible to change the conf to... [16:40:28] (03PS1) 10Joal: Correct typos in oozie jobs for alerts emails [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347635 [16:40:35] ottomata: if you have a minute --^ [16:42:37] ottomata: all reloads I were monitoring succeeded - must be related to workers restarts or something like that, and we didn't notice because of emails typos [16:45:34] (03PS1) 10Milimetric: [WIP] Design thoughts for AQS edit history API [analytics/aqs] - 10https://gerrit.wikimedia.org/r/347637 [16:46:04] joal sorry was eating lunch [16:46:09] np ottomata [16:47:09] joal: how did you rerun the jobs? oozie wouldn't let me [16:47:25] ottomata: I think it's because I already did it [16:47:32] oh [16:47:33] ohhh [16:47:33] ok [16:47:41] They were actually rerunning [16:48:02] I used hue to easily follow them (now with our changes on mysql, hue is really reactivem it's great) [16:50:03] yeah [16:50:23] ottomata: CR for emails please (few lines above ;) [16:50:53] (03CR) 10Ottomata: [V: 032 C: 032] Correct typos in oozie jobs for alerts emails [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347635 (owner: 10Joal) [16:50:57] merged joal [16:50:58] Many thanks [16:51:19] Let's not forget to deploy and restart job next time - Creating a task [16:51:26] k [16:53:29] 06Analytics-Kanban: Restart oozie jobs for email alerts correction - https://phabricator.wikimedia.org/T162715#3172240 (10JAllemandou) [16:55:28] ottomata, halfak: Trying to run revscoring in spark using the repo you provided, I get: https://gist.github.com/jobar/771aa17df27501ba8541f169574bf090 [16:56:18] joal: stat1004 is back, but it doesn't have the packages yet [16:56:26] not sure if that is related [16:56:26] awesome ottomata [16:56:28] but it could be [16:56:35] ottomata: nope, running in master mode [16:56:42] ok [16:56:56] ottomata: and actually running in venv mode, so not even needing packages to be installed [16:57:16] (following the link you provided ottomata - very interesting read) [16:57:26] joal, needs a deb installed [16:57:31] * halfak gets the line in puppet [16:57:52] https://github.com/wikimedia/puppet/blob/production/modules/ores/manifests/base.pp#L10 [16:57:58] "libopenblas-dev" [16:58:10] halfak: makes complete sense [16:58:13] :) [16:58:25] yeah, makes sense, we need that class on all nodes [16:58:30] doing.. [16:58:52] ottomata, is that something we're willing to do to add all those things (see the require pacvages in puppet file halfak past) on all nodes? [16:59:19] ottomata: like virtualenv, python3-dev, etc [16:59:42] ottomata: Or maybe actually, have our workers being some ores compatible workers too ? [17:00:54] joal: back, what was issue with monthly unique devices? [17:01:11] nuria: data is in, but not really in a nice way [17:01:23] nuria: max query granularity in druid is daily [17:01:29] And in piviot, week [17:01:35] joal:oohhh [17:01:46] so you get wholes [17:01:52] joal: i see , so only daily uniques for now [17:02:05] ok works for me [17:02:08] joal: which is fine, monthly metric is not as 'informative' [17:02:19] joal: yeah totally, we can add that class everywhere [17:02:22] that is a better way to do it [17:02:43] nuria: Also for the moment, I have added estimate, offset and underestimate metrics, as well as country dimension - Is that what we want ? [17:02:58] and also site right? [17:03:00] ottomata: please be carefull, I don't know what it does ;) [17:03:07] i do! it just installs deb packages [17:03:14] nuria: of course host is in (forgot to mention, so obvious) [17:03:27] huhu ottomata :) Thanks :) [17:03:40] joal:ok, then , that's it , sounds great [17:04:02] nuria: just wondered if having the 3 metrics was not misleading [17:04:07] you tell me [17:04:18] joal: no, the opposite we need three [17:04:26] ok awesome [17:04:36] joal: a point i have made before is that you need to look at changes on that metric separately [17:04:47] joal: underestimate measures "repeated" visists [17:08:41] makes sense nuria [17:08:42] (03PS4) 10Joal: Add oozie job loading daily uniques in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347611 (https://phabricator.wikimedia.org/T159471) [17:08:44] nuria: --^ [17:10:27] joal: ok, class is included, give it 30 mins to run everywhere [17:11:05] ottomata: awesome !!! Man, we're closer than ever (and without all the python packages mess I had you do - Sorry again) [17:20:31] ottomata: would you mind doing a pivot restart ? It has picked up test datasources that I cleaned (not available anymore) [17:21:16] ok [17:21:50] thanks [17:22:27] way cleaner :) Thanks a lot ottomata :) [17:23:32] gr8 :) [17:26:02] 06Analytics-Kanban, 13Patch-For-Review: Add unique devices dataset to pivot - https://phabricator.wikimedia.org/T159471#3068543 (10JAllemandou) Note: Only daily uniques are imported into druid. Monthly don't work because of druid not allowing for monthly granularity queries (maximum is day). [17:44:22] 10Analytics-Tech-community-metrics: Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? - https://phabricator.wikimedia.org/T157898#3172526 (10Aklapper) >>! In T157898#3170220, @Albertinisg wrote: >> I merged https://github.com/Bitergia/mediawiki-identities/commit/50ab30725ea9d6eb03487... [17:49:39] 06Analytics-Kanban: Refactor monthly banner oozie job to use already indexed daily data - https://phabricator.wikimedia.org/T159727#3172548 (10JAllemandou) a:03JAllemandou [17:50:02] (03PS1) 10Joal: [WIP] Update banner monthly job to reuse index [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347653 (https://phabricator.wikimedia.org/T159727) [18:05:16] I'll be back in an hour or so, running errands [18:33:54] (03PS2) 10Joal: [WIP] Update banner monthly job to reuse index [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347653 (https://phabricator.wikimedia.org/T159727) [19:51:36] 06Analytics-Kanban: Make sure oozie workflows sent e-mail if they fail - https://phabricator.wikimedia.org/T162742#3173163 (10Nuria) [19:52:33] (03CR) 10Nuria: "Small comments on commit message, thanks for doing these changes." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347635 (owner: 10Joal) [21:26:51] (03PS2) 10Milimetric: [WIP] Design thoughts for AQS edit history API [analytics/aqs] - 10https://gerrit.wikimedia.org/r/347637 [21:37:00] 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 3 others: Switch `/precache` to be a POST end point - https://phabricator.wikimedia.org/T162627#3173433 (10Ladsgroup) https://github.com/wiki-ai/ores/pull/192 [21:46:56] 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 3 others: Switch `/precache` to be a POST end point - https://phabricator.wikimedia.org/T162627#3173436 (10Ladsgroup) https://github.com/wiki-ai/ores/pull/192 [23:36:53] (03PS3) 10Nuria: Changes internal aqs api to accept a project or array of same [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/347305 (https://phabricator.wikimedia.org/T161933) (owner: 10Fdans) [23:41:58] (03CR) 10Nuria: "@milimetric I think this is the minimun set of changes so the same getData method can take a project or an array of projects. More changes" [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/347305 (https://phabricator.wikimedia.org/T161933) (owner: 10Fdans) [23:46:48] (03PS4) 10Nuria: Changes internal aqs api to accept a project or array of same [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/347305 (owner: 10Fdans) [23:47:31] 10Analytics-Dashiki, 06Analytics-Kanban: Refactor aqs api and usage for simplicity - https://phabricator.wikimedia.org/T161933#3173805 (10Nuria) Code changes here: https://gerrit.wikimedia.org/r/#/c/347305/ [23:47:42] 10Analytics-Dashiki, 06Analytics-Kanban: Refactor aqs api and usage for simplicity - https://phabricator.wikimedia.org/T161933#3173806 (10Nuria) a:05fdans>03Nuria [23:47:55] 06Analytics-Kanban, 10Analytics-Wikistats: Visual Language for http://stats.wikimedia.org replacement - https://phabricator.wikimedia.org/T152033#3173808 (10Nuria) [23:47:57] 06Analytics-Kanban, 10Analytics-Wikistats: Visual prototype for community feedback for Wikistats 2.0 iteration 1. - https://phabricator.wikimedia.org/T157827#3173807 (10Nuria) 05Open>03Resolved [23:48:09] 06Analytics-Kanban: Add ability to query new AQS endpoint to node pageview API client - https://phabricator.wikimedia.org/T160655#3173809 (10Nuria) 05Open>03Resolved [23:48:11] 06Analytics-Kanban, 13Patch-For-Review: Move reportcard to dashiki and new datasources - https://phabricator.wikimedia.org/T130117#3173811 (10Nuria) [23:48:23] 06Analytics-Kanban, 13Patch-For-Review: Pagecounts all sites data issues - https://phabricator.wikimedia.org/T162157#3173812 (10Nuria) 05Open>03Resolved [23:48:25] 06Analytics-Kanban, 13Patch-For-Review: Populate aqs with legacy page-counts - https://phabricator.wikimedia.org/T156388#3173813 (10Nuria) [23:48:40] 06Analytics-Kanban: Add AQS's new pagecounts endpoint to mediawiki-services-restbase - https://phabricator.wikimedia.org/T161495#3173814 (10Nuria) 05Open>03Resolved [23:49:04] 06Analytics-Kanban, 13Patch-For-Review: Move reportcard to dashiki and new datasources - https://phabricator.wikimedia.org/T130117#2126320 (10Nuria) [23:49:07] 06Analytics-Kanban, 13Patch-For-Review: Populate aqs with legacy page-counts - https://phabricator.wikimedia.org/T156388#2973209 (10Nuria) 05Open>03Resolved [23:49:15] 06Analytics-Kanban: All Dashiki Dashboards down - https://phabricator.wikimedia.org/T162320#3173817 (10Nuria) 05Open>03Resolved [23:58:35] 06Analytics-Kanban: Check abnormal pageviews for XHamster - https://phabricator.wikimedia.org/T158071#3173841 (10Nuria) {F7496523}