[00:02:22] (03PS1) 10Fdans: [wip] Unifies usage of aqs api and datasets api [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/347305 (https://phabricator.wikimedia.org/T161933) [07:43:24] joal: o/ [07:43:50] this morning I'd like to failover hdfs and yarn masters from an1001 to an1002 [07:44:06] to test a master on Debian [07:44:31] if the test goes well the plan is to reimage 1001 when Andrew will be online [08:08:47] (03PS2) 10Fdans: [wip] Unifies usage of aqs api and datasets api [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/347305 (https://phabricator.wikimedia.org/T161933) [08:18:42] fdans: shouldn't you be on vacation?? [08:31:37] Hi elukey !@ [08:31:49] elukey: good idea on alarms names !!! [08:31:55] elukey: please, go with failover :) [08:38:47] * elukey proceeds! [08:44:06] 10Analytics-Tech-community-metrics: Provide equivalent of "SCR: Code review users vs. Code review committers" in Kibana - https://phabricator.wikimedia.org/T151558#3167639 (10Aklapper) p:05Normal>03Low [08:46:01] Resource manager failed over to an1002 [08:47:06] and also NameNode [08:48:28] 10Analytics-Tech-community-metrics: Provide equivalent of "SCR: Open changesets vs. Open changesets waiting for review (CR0 / CR+1)" in Kibana - https://phabricator.wikimedia.org/T151555#3167650 (10Aklapper) p:05Normal>03Low [08:48:45] 10Analytics-Tech-community-metrics: Provide equivalent of "SCR: Age of open changesets (monthly snapshots)" in Kibana - https://phabricator.wikimedia.org/T151557#3167653 (10Aklapper) p:05Normal>03Low [08:48:56] * joal watches [08:56:15] joal: yarn.w.o should work now, I applied a manual change to apache config on thorium [08:56:33] elukey: it does indeed work :) [08:56:59] elukey: my long running job (Streaming Banners) has not failed [08:57:01] :) [09:03:38] metrics looks good and nothing seems on fire [09:03:40] goooooood [09:04:07] I'll let it run in this way for the next 2/3 hours just to be sure [09:04:14] and then I'll reimage an1001 with Andrew [09:04:32] awesome [09:39:58] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3167774 (10Addshore) So to be done 'properly' this will require an oozie job to be created for the hadoop -> graphite data. A vague exmaple can be seen @ https://githu... [11:49:11] * elukey lunch! [12:56:26] joal: I'd need to reboot some Hadoop workers to let them pick the new kernel [12:56:36] so probably I'll cause some disturbance in the force [12:56:56] elukey: I could have felt it ;) [12:57:41] elukey: if you don't mind, can you wait for me to go in break ? [12:57:47] like in 10 mins or so [12:57:57] currently playing with a spark-sheekl [12:59:29] sure! I'd need to do 1040->1050 [12:59:41] not really urgent though [12:59:42] elukey: no prob once I'm done ;) [13:07:30] elukey: please go for it, my stuff broke ;) [13:08:15] ottomata: I'm leaving for a break now --sweet reminder about python packages for spark ;)( [13:11:08] oh joal that's done! [13:11:33] ottomata: o/ [13:12:21] an1002 has been working as primary since this morning EU time [13:12:24] all good for the moment [13:12:45] great [13:13:10] whenever you are ready to go we can reimage an1001 [13:18:51] ok so starting with the hadoop workers reboot (1040->1050, new kernel) [13:20:39] you have to reboot again!? [13:20:47] elukey: we were going to do an01 tomorrow ya [13:20:48] ? [13:20:54] ottomata: I thought today! [13:21:09] eithe rway! [13:21:11] but we can do it tomorrow no prob [13:21:12] we can do today [13:21:13] haha [13:21:26] lemme get through emails and then we see [13:22:15] sure sure [13:22:25] so 1040->1050 were installed with 4.4 :( [13:22:33] and the rest is running 4.9 [13:25:39] ah ok [13:27:46] sudo cumin 'analytics10[41-50]*' 'disable-puppet' -t 10 [13:27:46] 10 hosts will be targeted: [13:27:46] analytics[1041-1050].eqiad.wmnet [13:27:51] magic [13:28:01] -t 10 [13:28:07] is that a safety limit? [13:28:27] nono timeout [13:49:05] hi team :] [13:50:44] hiii [13:51:07] mforns: welcome back! [13:52:02] elukey: let's do an01 today. you wanna drive? [13:53:22] ottomata: sure.. we can chat in here without hangouts, not prov [13:53:24] *prob [13:53:48] so you backed up /var/lib/hadoop/etc.. somewhere? [13:53:54] I don't remember [13:54:13] ya [13:54:23] so [13:54:29] i made an rsyncd.conf file [13:54:35] you can use the one i made in my home on an02 [13:54:38] then [13:54:51] sudo rsync --daemon --no-detach --config rsyncd.conf (or something like that) [13:54:57] (after stopping ferm) [13:55:03] then, after stopping namenode, etc. [13:55:10] i just rsynced the dir to stat1004 [13:55:14] just to have a backup copy [13:55:17] just in case [13:56:17] and then stopped daemons, rebooted etc.. and kept partitions [13:56:37] yup [13:56:53] also did hypethreading on reboot [13:57:13] oh right writing a note [13:58:23] waiting for an1041-1043 to come up after reboot [13:58:27] then I'll start [13:58:49] k [14:07:05] elukey@analytics1001:~$ sudo service hadoop- [14:07:05] hadoop-hdfs-namenode hadoop-hdfs-zkfc hadoop-httpfs hadoop-mapreduce-historyserver hadoop-yarn-resourcemanager [14:07:09] ottomata: --^ [14:07:15] we may have a complication [14:07:25] hadoop-httpfs and hadoop-mapreduce-historyserver [14:07:52] they are not running on an1002 [14:08:09] httpfs might not be that important [14:08:14] but hadoop-mapreduce-historyserver probably is [14:08:55] mmm from https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html seems only a user facing thing [14:08:58] so nothing used internall [14:09:03] *internally [14:09:08] okok might be fine [14:09:54] yea i think its fine [14:10:05] both of those are nice services for users [14:10:18] hue might use httpfs, not sure [14:10:20] but it won't break [14:10:27] you just wouldn't be able to browse the filesystem [14:11:17] yep yep [14:11:49] ottomata: do you mind to backup /var/lib/hadoop/name? It would take less for you with your rsync foo :) [14:12:52] can do [14:13:27] haha oh right the rsyncd.conf file isn't in my home on an02..we reimaged it :) [14:15:00] elukey: let me know when I can, we need namenode stopped first [14:15:01] I feel a bit ashamed but I always need 10/15 minutes to get up to speed with rsync each time that use it :/ [14:15:08] already done! [14:15:11] you can proceed [14:15:17] i just copy / paste and edit from the one on stat1002 or wherever [14:15:19] k! [14:17:04] elukey: done [14:18:09] gooood! so I'll reboot and enable ht then ottomata [14:18:10] ok? [14:20:28] ya proceed! [14:25:37] same weird error on saving when enabling ht [14:29:14] hm elukey i guess we didn't actually check that ht was enabled when an02 came back up [14:29:26] i see 32 procs though [14:29:28] so i guess so! [14:29:46] oh yes I checked, it is enabled [14:32:19] ottomata: just to be sure, if you want to log in the an1001's console to check the partitions settings [14:32:35] I've set /var/lib/hadoop/name mountpoint and root [14:32:38] ext3 and ext4 [14:33:00] (connect to console and move the arrows, already at partitioning screen) [14:36:06] ottomata: ? [14:37:15] going to proceed then, we have backups :P [14:37:33] elukey: sorry [14:37:33] ya [14:37:36] proceed i trust ya! [14:37:41] we have backups :) [14:38:00] I don't trust myself! :P [14:46:45] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-Vagrant, 06Services (watching): Vagrant git-update error for event logging - https://phabricator.wikimedia.org/T161935#3168506 (10Ottomata) Hm, dunno what is up with vagrant git-update in general. EventLogging seemed to update just fine for me: ``` ==> U... [14:47:07] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-Vagrant, 06Services (watching): Vagrant git-update error for event logging - https://phabricator.wikimedia.org/T161935#3168509 (10Ottomata) Maybe gerrit was just unresponsive when you ran this @Pchelolo ? [14:51:57] 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Write Spark schema differ / Hive DDL generator - https://phabricator.wikimedia.org/T161924#3168519 (10Ottomata) a:03JAllemandou [14:53:11] 10Analytics-EventLogging, 06Analytics-Kanban, 13Patch-For-Review: Write Spark schema differ / Hive DDL generator - https://phabricator.wikimedia.org/T161924#3147352 (10Ottomata) I moved this to done, since Joseph wrote a good prototype for this. This helped me with T153328. We'll make a new ticket to track... [14:55:32] 10Analytics, 10Analytics-EventLogging: Implement EventLogging Hive refinement - https://phabricator.wikimedia.org/T162610#3168526 (10Ottomata) [14:55:46] 10Analytics, 10Analytics-EventLogging: Implement EventLogging Hive refinement - https://phabricator.wikimedia.org/T162610#3168544 (10Ottomata) [14:55:48] 10Analytics-EventLogging, 06Analytics-Kanban: Research Spike: Better support for Eventlogging data on hive - https://phabricator.wikimedia.org/T153328#2877030 (10Ottomata) [15:00:33] elukey, mforns , joal: standddupppppp [15:01:25] PROBLEM - YARN NodeManager Node-State on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:26] PROBLEM - YARN NodeManager Node-State on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:27] PROBLEM - YARN NodeManager Node-State on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:28] PROBLEM - YARN NodeManager Node-State on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:31] PROBLEM - YARN NodeManager Node-State on analytics1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:31] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:32] PROBLEM - YARN NodeManager Node-State on analytics1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:33] PROBLEM - YARN NodeManager Node-State on analytics1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:34] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:36] PROBLEM - YARN NodeManager Node-State on analytics1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:37] PROBLEM - YARN NodeManager Node-State on analytics1043 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:38] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:40] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:41] PROBLEM - YARN NodeManager Node-State on analytics1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:42] PROBLEM - YARN NodeManager Node-State on analytics1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:43] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:44] PROBLEM - YARN NodeManager Node-State on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:45] PROBLEM - YARN NodeManager Node-State on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:47] PROBLEM - YARN NodeManager Node-State on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:48] wow [15:01:48] PROBLEM - YARN NodeManager Node-State on analytics1036 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:49] PROBLEM - YARN NodeManager Node-State on analytics1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:50] PROBLEM - YARN NodeManager Node-State on analytics1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:51] PROBLEM - YARN NodeManager Node-State on analytics1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:12] PROBLEM - YARN NodeManager Node-State on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:42] elukey: ooooo [15:02:48] nuria, coming! [15:03:33] RECOVERY - YARN NodeManager Node-State on analytics1045 is OK: OK: YARN NodeManager analytics1045.eqiad.wmnet:8041 Node-State: RUNNING [15:03:34] RECOVERY - YARN NodeManager Node-State on analytics1048 is OK: OK: YARN NodeManager analytics1048.eqiad.wmnet:8041 Node-State: RUNNING [15:03:35] RECOVERY - YARN NodeManager Node-State on analytics1046 is OK: OK: YARN NodeManager analytics1046.eqiad.wmnet:8041 Node-State: RUNNING [15:03:36] RECOVERY - YARN NodeManager Node-State on analytics1056 is OK: OK: YARN NodeManager analytics1056.eqiad.wmnet:8041 Node-State: RUNNING [15:03:37] RECOVERY - YARN NodeManager Node-State on analytics1037 is OK: OK: YARN NodeManager analytics1037.eqiad.wmnet:8041 Node-State: RUNNING [15:03:39] RECOVERY - YARN NodeManager Node-State on analytics1040 is OK: OK: YARN NodeManager analytics1040.eqiad.wmnet:8041 Node-State: RUNNING [15:03:40] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [15:03:41] RECOVERY - YARN NodeManager Node-State on analytics1042 is OK: OK: YARN NodeManager analytics1042.eqiad.wmnet:8041 Node-State: RUNNING [15:03:42] RECOVERY - YARN NodeManager Node-State on analytics1043 is OK: OK: YARN NodeManager analytics1043.eqiad.wmnet:8041 Node-State: RUNNING [15:03:43] what the hell [15:03:43] RECOVERY - YARN NodeManager Node-State on analytics1044 is OK: OK: YARN NodeManager analytics1044.eqiad.wmnet:8041 Node-State: RUNNING [15:03:45] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [15:03:46] RECOVERY - YARN NodeManager Node-State on analytics1054 is OK: OK: YARN NodeManager analytics1054.eqiad.wmnet:8041 Node-State: RUNNING [15:03:47] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [15:03:48] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [15:03:49] RECOVERY - YARN NodeManager Node-State on analytics1041 is OK: OK: YARN NodeManager analytics1041.eqiad.wmnet:8041 Node-State: RUNNING [15:03:50] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [15:04:01] RECOVERY - YARN NodeManager Node-State on analytics1049 is OK: OK: YARN NodeManager analytics1049.eqiad.wmnet:8041 Node-State: RUNNING [15:04:02] RECOVERY - YARN NodeManager Node-State on analytics1055 is OK: OK: YARN NodeManager analytics1055.eqiad.wmnet:8041 Node-State: RUNNING [15:04:04] RECOVERY - YARN NodeManager Node-State on analytics1051 is OK: OK: YARN NodeManager analytics1051.eqiad.wmnet:8041 Node-State: RUNNING [15:04:05] RECOVERY - YARN NodeManager Node-State on analytics1057 is OK: OK: YARN NodeManager analytics1057.eqiad.wmnet:8041 Node-State: RUNNING [15:04:06] RECOVERY - YARN NodeManager Node-State on analytics1053 is OK: OK: YARN NodeManager analytics1053.eqiad.wmnet:8041 Node-State: RUNNING [15:04:07] RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING [15:04:08] RECOVERY - YARN NodeManager Node-State on analytics1036 is OK: OK: YARN NodeManager analytics1036.eqiad.wmnet:8041 Node-State: RUNNING [15:04:21] elukey: the services statyed up [15:04:22] RECOVERY - YARN NodeManager Node-State on analytics1050 is OK: OK: YARN NodeManager analytics1050.eqiad.wmnet:8041 Node-State: RUNNING [15:04:30] so it must be a temporary networking or icinga problem [15:05:10] nuria: sorry will complete testing before joining :( [15:05:38] so only yarn [15:27:22] 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 3 others: Create generalized "precache" endpoint for ORES - https://phabricator.wikimedia.org/T148714#3168638 (10mobrovac) I agree that the minimum that should be done here is to switch to POST-style requests. Would you be availabl... [15:29:11] 10Analytics: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3168642 (10Nuria) [15:34:06] 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 3 others: Create generalized "precache" endpoint for ORES - https://phabricator.wikimedia.org/T148714#3168675 (10Ladsgroup) Yeah, Just please make a phab card (a subtask of this) and assign it to me. I get it done ASAP. [15:35:46] 06Analytics-Kanban: Add unique devices dataset to pivot - https://phabricator.wikimedia.org/T159471#3168679 (10JAllemandou) a:03JAllemandou [15:37:41] 10Analytics: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3168685 (10Nuria) There are two steps: * purging data from master based on policy * replicating data deletes to slaves. [15:52:12] 10Analytics: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3168697 (10Nuria) [15:58:07] 10Analytics: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3168716 (10Nuria) Also, please document option choosen [16:01:57] ottomata: puppet is failing on all the workers [16:01:58] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: undefined method `function_create_resources' for nil:NilClass at /etc/puppet/modules/role/manifests/analytics_cluster/hadoop/worker.pp:120 on node analytics1032.eqiad.wmnet [16:02:51] 06Analytics-Kanban: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#2990326 (10Nuria) [16:06:55] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#1835400 (10Nuria) It is not clear what is the value of this data, can someone explain? [16:07:32] ottomata: ahhh checked the file, it seems related to your last merge [16:08:31] https://gerrit.wikimedia.org/r/#/c/346812/ [16:10:36] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#2783668 (10Nuria) @kaldari: it doesn't seem like eventlogging i sthe best fit for this If you... [16:12:30] ottomata: Just checked, seems that various workers (an1044 -> an1050 included) have puppet disabled :( [16:12:40] joal: puppet is failing [16:12:53] I think probably due to the patch [16:13:00] (posted a bit above) [16:13:35] 10Analytics, 10Analytics-Dashiki: annotations should show on tab layout - https://phabricator.wikimedia.org/T162482#3168742 (10Milimetric) p:05Triage>03Normal [16:13:54] maybe some pkg is not right? [16:14:11] hm - only ottomata knowd [16:14:46] 10Analytics, 10Analytics-Cluster: Refactor webrequest_source partitions and oozie jobs - https://phabricator.wikimedia.org/T116387#3168743 (10Nuria) p:05Low>03Normal [16:15:15] 10Analytics, 10Analytics-Dashiki: annotations should show on tab layout - https://phabricator.wikimedia.org/T162482#3164702 (10Milimetric) Just a quick FYI: here's a bit better way to show annotations, just modifying the style: {F7459203} It's just an example, it should be thought through, with probably hove... [16:15:18] he always knows [16:15:20] :) [16:15:44] it's not only that he always knows, it's about him being the ONLY one ! [16:15:47] elukey: --^ [16:16:09] 10Analytics, 10Analytics-Dashiki: annotations should show on tab layout - https://phabricator.wikimedia.org/T162482#3168747 (10Nuria) p:05Normal>03High [16:17:50] 10Analytics: Hand off of Christian's MaxMind geolocation databases repository - https://phabricator.wikimedia.org/T89453#1036629 (10Nuria) We were thinking about geo coding anonymous edits through history (we have those from 2014). We do not think we need this snapshot after. [16:19:35] 06Analytics-Kanban: Verfify MaxMind is updated regularly - https://phabricator.wikimedia.org/T162616#3168750 (10Nuria) [16:19:44] 06Analytics-Kanban: Verfify MaxMind is updated regularly - https://phabricator.wikimedia.org/T162616#3168766 (10Nuria) p:05Triage>03High [16:23:44] there are packages with multiple versions, might be the issue? [16:25:40] elukey: on maxmind? [16:26:02] nuria: nono python packages on apt :) [16:26:10] elukey: oohhh [16:31:51] goin to lunch, will take a look at MaxMind after [16:33:09] nuria: here? [16:33:17] yes, sorry, omw [16:33:17] I am going to revert Andrew's change to see if the puppet failures are indeed due to that [16:33:27] sure elukey [16:36:11] joal: yeah it fixes the puppet runs [16:45:09] sent an email to Andrew, I think he has connectivity issues [16:45:53] 10Analytics: Measure portal pageviews (wikimedia.org) - https://phabricator.wikimedia.org/T162618#3168814 (10Nuria) [16:57:26] so an1001 seems running fine on Debian [16:57:47] I'm not surprised elukey - Everybody is running fine on debian ;) [16:57:59] joal: I am still a bit puzzled about the yarn alarms [16:58:06] yeah, weirdoh :( [16:58:45] the timing is super weird, since it affected all the worker nodes at the time that I was reimaging an1001 [16:58:51] but from the logs I don't see much [16:59:06] and we haven't observed any failure at job level afaics [16:59:28] an1001 was running the mapreduce history server and the http webserver [16:59:55] and an1002 is not [17:00:06] so having them down might have affected yarn node managers? [17:07:59] joal: i'm starting to write up pieces of our machine learning data processing pipeline on pyspark as patches to gerrit, could i add you to review general spark stuff for sanity check? [17:08:32] please ebernhardson [17:08:35] joal: thanks! [17:08:56] ebernhardson: I'm not so good with pyspark, I prefer scala, but it's nice to include me is the pipline :) [17:09:51] joal: we pondered java vs scala vs python, went with python because we have the most familiarity there. I think the spark api is pretty much the same between them, and things like caching/rdd's/sql/etc should be the same across languages [17:19:15] ebernhardson: it's not that different for sure, I should be able to understand at read :) [17:20:16] elukey: about the alarms: what is the NRPE check? [17:24:50] joal: it is nagios trying to check if the yarn nodemanager is running [17:25:00] we have a custom script that should return RUNNING [17:25:08] maybe that one for a blip returned another STATE [17:25:54] elukey: Do you know iof the URL nagios checks uses yarn UI proxying? [17:32:10] ?? [17:35:17] elukey: I have problems with it currently, so I wondered [17:36:51] joal: sorry I didn't get what you mean [17:36:57] have you issues with yarn.w.o ? [17:37:25] Yarn UI itself works, but I can't have it proxying to other nodes [17:37:28] https://yarn.wikimedia.org/proxy/application_1491813716361_0669/ [17:38:37] joal: fixing it, now that an1001 is up again [17:39:21] is it only that one? [17:39:56] elukey: only error I've see so far [17:40:03] I think that the current apache config is not working super fine with an1002 as primary [17:40:09] k [17:42:26] elukey: i don't understand that puppet error you saw [17:42:36] not sure how it could be related to my change, the packages install fine, and also https://puppet-compiler.wmflabs.org/6070/ [17:43:12] ottomata: I checked the line that was erroring out and it was the require_package with all the python packages [17:43:17] no idea why it was failing [17:43:28] yeah [17:43:31] but it started when for the first puppet runs after your merge [17:43:35] makes sense [17:43:51] didn't find you online so I thought to try and revert to check :( [17:43:53] elukey: i'm going to re-merge it but i'll watch it better and see [17:43:54] np [17:44:10] well it will probably fail again no? [17:44:23] maybe but do you have the whole error message? it doesn't make any sense and i can't test it [17:44:27] installing the packages manually work fine [17:44:46] that one was the full error :D [17:45:01] maybe the require_package function does something weird [17:45:13] that fails for some package versions etc.. [17:45:14] maybe it has a limit to the number of packages it can take? [17:45:20] dunno :( [17:45:44] ottomata: would you mind to check an1001 and see if it looks ok? [17:47:38] elukey: looks good t o me [17:50:24] ottomata: shall we restore masters to an1001? [17:50:33] ya! [17:50:41] all right doing it! [17:50:46] joal: --^ [17:50:56] elukey: k ! [17:54:13] 1001 is back the masterz [17:56:13] joal: https://yarn.wikimedia.org/cluster/app/application_1491813716361_0669 works now :) [17:56:32] Thanks ! [17:57:12] so masters completed!! \o/ [17:57:25] YAY !!! BEERS ! [18:01:02] clink clink (virtual glasses going 'cheers ) [18:09:50] all right logging off [18:10:12] kudos to Andrew that is resolving the python dependencies mess for the new packages [18:10:21] haha [18:10:29] elukey: i think the main problem was obvious but i didn't see it [18:10:34] require_package doesn't work with version numbers [18:10:41] lol [18:11:04] but now there's a new one i don't fully understand, but i think it has to do with using package resource to include a package and then using require_package for one that depends on the first packgae [18:11:07] i think i almost got it though [18:11:33] laters elukey! nice job on an01 [18:11:45] oh elukey puppet disabled on an44 [18:11:45] ? [18:11:48] is that on purpose? [18:12:28] ottomata: ah no it is when I rebooted! [18:12:33] taking care of it sorry [18:13:15] np [18:13:22] ah snap 1044 wasn't rebooted [18:13:57] will finish the reboots tomorrow then [18:14:46] k [18:14:57] its so weird that puppet compiler can't catch these package deps [18:15:02] i guess it can't know until it runs apt-get stuff [18:19:49] yeah [18:20:04] it is really dumb and suffers a lot of other little bugs [18:20:51] * elukey runs afk! byyyyeee o/ [18:22:02] joal: you wanna make a ticket or just email me about your nltk data thing? telling me what you want to have installed and where? [18:23:44] phew also finally got those python packages happy [18:23:51] give it 30 and it'll be everywhere [18:36:26] Thanks ottomata :) [18:36:31] emailing you [18:38:48] nuria: it you want to have a look: https://wikitech.wikimedia.org/wiki/User:Joal/FBProphet [18:40:36] joal: Super great [18:42:35] cool [18:43:29] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3169276 (10kaldari) @Nuria: I don't see anything in the documentation for the Data Lake indica... [18:45:57] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3169284 (10kaldari) @Neil_P._Quinn_WMF: Would you be able to use the [[ https://wikitech.wikim... [18:47:35] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3169290 (10Nuria) More specific docs: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/... [18:48:09] nuria: Thanks, I left that comment before fully reading through the docs :P [18:48:59] kaldari: ok, i think most of the info os ready for you to use it , i know we are planning on adding more detailed info to user table too [18:49:15] kaldari: all this is in hive so it is clear [18:49:38] nuria: Do you think it would be possible to see how many edits a user had at the time of article creation using the data lake data? [18:49:49] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#2783668 (10Ottomata) If you do need this data in an event based form, it sounds like EventBus... [18:50:19] a-team: weird suspended jobs in oozie (see emails)- Currently trying to restart the, [18:50:30] kaldari: i think that is one field we have a request to add [18:51:22] nuria: any idea on a timeframe for that? This is for a high-priority (i.e. emergency request) that has been open since October :( [18:52:18] kaldari: we will probably have those b end of quarter but note, you can get all info but that and just do an additional query for edit counts [18:52:48] nuria: I assume the emails we have on uniques-daily are from your code - Can you confirm? [18:53:00] need to go for diner a-team, will come and double check stuff after [18:53:03] joal: yes they are SORRY [18:54:33] kaldari: so i would get all info but edit counts from data lake and once you have a list of users you can get the additional data. the ticket: https://phabricator.wikimedia.org/T161147 [18:54:43] 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 3 others: Switch `/precache` to be a POST end point - https://phabricator.wikimedia.org/T162627#3169314 (10mobrovac) [18:58:07] nuria / kaldari: we can escalate the priority of that request, I know Neil said it's tricky or hard to get the edit count at the time of the edit manually [18:59:10] milimetric: yes, seems like it would need an intermediate computation step [18:59:31] keep us in the loop kaldari, and we can help. I see the request was only tagged Analytics last Friday, so we wouldn't have seen it until then. [18:59:34] but kaldari's data request is (i think) all on data lake but that one data point [18:59:39] yep [19:00:04] that data point is relatively easy for us to compute as part of user_history reconstruction [19:00:08] nuria: Sounds like we still need to query against the revision and archive tables then, which is doable, but not much better than what we can do without the data lake. [19:00:40] kaldari: no, you can query against teh data lake itself [19:00:56] kaldari: the advantage is you can do all that in Hadoop / Hive / paws internal. Because we have revision and archive across all wikis imported in there too [19:01:05] OK [19:01:10] so it's dramatically easier and faster than mysql [19:01:17] oh good :) [19:01:28] I'll let Neil know [19:01:57] kaldari: so, while it is not 1 stop [19:02:00] yeah, and really, we can add that field if it's a major problem for him. Like if he thinks it'll take him longer than a day to work around that problem, I'll just add the field [19:02:11] cc neilpquinn [19:02:39] kaldari: it doesn't require you to query mediawiki db [19:03:31] nuria: Oh yeah, I meant the mirrors of those tables in the data lake. But I guess it's still going to be a lot faster than querying against the raw mediawiki tables. [19:03:46] kaldari: as is gazillions [19:03:54] sweet [19:04:07] thanks for the info, gotta run to lunch [19:06:44] milimetric: I'm rusty on the details, but I think we need both the redirect field and the edit count field for that project (new article creation analysis). [19:08:32] neilpquinn: it says "whether or not the page is a redirect", and we have page_is_redirect_latest. If you mean you need to know whether the page was a redirect at the time of the edit, nobody has that right now, we'd need to mine wikitext for it [19:09:26] so the main question is: is it too much of a pain for you to get the edit count. If so, then what's your deadline? [19:11:05] (03CR) 10Nuria: [V: 032 C: 032] Correcting loading of pagecounts into cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/346802 (https://phabricator.wikimedia.org/T162157) (owner: 10Nuria) [19:11:05] not sure how important the historical redirect info would be specifically. Would have to think about that and sadly I won't have much time for analysis over the next. [19:12:07] milimetric: I'm not sure what the deadline is, I haven't worked on that task in a while. kaldari is the expert. [19:13:00] milimetric: unfortunately, I won't have much time for analysis for the next 3-4 months. I'm project managing https://meta.wikimedia.org/wiki/New_Editor_Experiences [19:13:45] cool project, is someone else going to help kaldari? Just let us know who to talk to and we'll try and make the data work for them [19:14:43] milimetric: the deadline was some time last year, lol. The real deadline is before the english wikipedia community restricts new page creation to autoconfirmed users which they are still threatening to do. [19:15:37] kaldari: ok, and who's going to work on it, just you? [19:16:31] heh, sorry, didn't mean "just" like in any bad way :) [19:16:43] I meant is anyone helping you? [19:17:02] Neil Quinn was helping [19:18:13] (03PS2) 10Nuria: Parqued Code - Asiacell modfications on unqiue devices [analytics/refinery] - 10https://gerrit.wikimedia.org/r/341835 (https://phabricator.wikimedia.org/T158237) [19:20:14] kaldari: that ticket soesn't talk about edit count though, is tha somehow related to 'autoconfirmed " users? (asking from the IGNORANCE) [19:21:11] autoconfirmed = 10 edits and at least a week old [19:22:40] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3169411 (10Legoktm) To get an idea of how many people are using the MediaWiki tarballs. Additionally, this data could be used to see if people are still downloading old... [19:23:18] ok, so nobody's helping any more and you need this yesterday. Um... nuria I think the best we can do is add the edit_count field and show a sample query. [19:34:20] milimetric, kaldari : but let's put this in sight with how many requests we have [19:35:37] nuria: I agree. We could task it and see what this would have to displace if we did it [19:36:10] milimetric: but data can be gathered now, correct? It is just a bit more cumbersome [19:36:55] milimetric: thsi request is been outstanding since October, it cannot be an emergency on our end when we learn about it today [19:37:04] nuria: it can in theory, but kaldari doesn't have any analyst helping him, and the edit_count was a bit tricky [19:37:29] it would be much more realistic for him to complete the task with an edit_count field, and I know it's of high value to others as well [19:37:43] nuria: no, definitely we can't treat this as an emergency [19:37:54] I'm proposing we task it on Thursday and decide then [19:38:26] and I'm only saying that because I know joseph wanted to add the edit_count field too, it was something that was bothering us for a while, and we de-prioritized it until we productionized the rest of the job [19:39:12] milimetric: we already looked at the edit count request and triaged it for July in the light of how much did we had to do this quarter, we agreed July was the timeframe we could tackle it [19:39:59] right, but we can re-evaluate that in light of new information [19:40:48] I'm not saying we should go out of process for it, agreed with you there [19:44:00] milimetric: we can revisit the decision, sure, but an urgent data request from october should not be the driver. If the data request is urgent someone needs to start gathering that data today and not wait for our changes . if , that is, gathering that data is doable but cumbersome (and sounds like it is) cc kaldari [19:45:04] 06Analytics-Kanban: Announce analytics.wikimedia.org/datasets and deprecation of datasets.wikimedia.org - https://phabricator.wikimedia.org/T159409#3169528 (10Milimetric) Latest thinking on this: 1. Add a redirect from https://datasets.wikimedia.org to https://analytics.wikimedia.org/datasets/archive. 2. Move a... [19:45:42] milimetric, nuria: just read the thread [19:45:46] nuria + kaldari: right, we have really good data to support this task, literally thousands of times better than in the past [19:46:02] so it should be doable if we can get people to work on it [19:46:29] but separately, I think it'd be good if we took a quick look at adding the edit_count field, it is a thorn in my side [19:46:32] milimetric, nuria: Looks like it's a 1-off that needs user-edits-count at time of edit, and is_redirect for the given edit, correct? [19:46:48] joal: yep, and we have is_redirect [19:47:00] milimetric: is_redirect_latest, right [19:47:11] right, confirmed that's all that's needed [19:47:18] (is_redirect_latest) [19:47:44] milimetric: really? Nice - I would have expected it to be not enough for historical analysis [19:48:05] well, it'd be awesome if we had a proper is_redirect, but nobody has that now, literally nobody [19:48:18] we'll get that later when we parse wikitext [19:48:39] so for now _latest works just fine [19:48:43] milimetric: 4 of the new fields we are talking about require the same type of addition to the denormalisation job (cumulative edit-counts per page/user, and time between edits per user/page) [19:49:10] you're thinking it'd be good to lump them together? [19:49:12] milimetric: for a 1 off, and for enwiki only, I can have that for soon [19:49:23] nah, no need for that [19:49:42] correct milimetric, if we add 1 of the fields, adding the other 3 is like so similar it'd be dumb not too [19:49:42] I would only be for this if it improves the infrastructure, not to serve the immediate request only [19:49:54] makes sense to me [19:50:08] but yeah, I think we should re-prioritize on Thursday [19:50:18] milimetric: I have text in parquet for enwiki, I can reasonably easily parse it and extract redirect (at least an easy format of it) [19:50:31] nono, that's definitely not needed [19:50:33] milimetric: let's separate concerns here, right ;) [19:50:41] ok [19:50:51] zero analysis has ever needed that so far, it'll be new data when you do it, save some goodies for later :) [19:50:59] :D [19:51:43] nuria: saw your email on jobs emails - Some jobs were in SUSPEND mode (I have no idea how they ended up there) [19:51:51] nuria: I resumed them [19:52:00] joal: and also i do not know what to do further i can see: [19:52:18] joal: for the job suspended [19:52:18] also nuria, make sure you differenciate between SLA errors (job is taking too long), and job errors [19:52:21] https://www.irccloud.com/pastebin/IxZURrbz/ [19:52:45] nuria: which job is that? [19:53:06] joal: teh suspended one, full screen [19:53:10] https://www.irccloud.com/pastebin/Q3QBmg4i/ [19:53:19] Error is JA006 [19:53:48] nuria: I resumed coordinators, that's weird some workflows are still suspended ! [19:53:53] joal: so error is JA006 [19:53:59] nuria: for resuming: oozie job --resume JOBID [19:54:00] joal: did you re-start? [19:54:18] nuria: suspend should only need resume, not rerun [19:54:30] ah , and how do you do that? [19:54:37] ^ joal [19:54:56] how would I ssh to analytics1011.eqiad.wmnet? Or check its /var/log ? Or any other worker node? [19:55:20] Again nuria: for resuming: oozie job --resume JOBID [19:55:48] nuria: I now see plenty of suspended workflows [19:55:59] nuria: do you want me to resule them or do you go for it? [19:55:59] joal: ya, i saw those too [19:56:14] joal: not sure if you can filter for those fast [19:56:30] 7 of them, I'll manage :) [19:57:25] kaldari: sum up for our internal discussion is that you should start gathering that data now, from data lake. Even if we add the edit count fields earlier than we though they will not be done as fast as you might need them [19:57:28] actually nuria, hue works awesome to do ir [19:57:48] nuria: resumed everuthing in 3 clicks [19:58:02] joal: all right [19:59:12] nuria: Thanks for having checked the workflows, I'd have missed that [19:59:41] np, i do not think i did very much send it to list cause i had not seen SUSPENDED before though [20:00:03] nuria, kaldari: We can provide help for functions to gather cumulative edit counts (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics) [20:01:50] yeehaw, +10 nodes! 2 are giving me trouble [20:01:54] but we are up a buncha worker nodes now woooo [20:02:06] ottomata: That is fantastic news :) [20:07:03] nice! [20:11:06] 06Analytics-Kanban: Verfify MaxMind is updated regularly - https://phabricator.wikimedia.org/T162616#3168750 (10Milimetric) Findings ======= GeoIP update is puppetized here: https://github.com/wikimedia/puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/geoip/manifests/data/maxmind.pp#L77 GeoIP also s... [20:11:20] 06Analytics-Kanban: Verfify MaxMind is updated regularly - https://phabricator.wikimedia.org/T162616#3168750 (10Milimetric) a:03Milimetric [20:15:17] ottomata: nother ask for you on packages: can you add stat100[2\4] on the list for the python things? [20:15:26] oh for those ones? [20:15:27] ya [20:16:31] ottomata: also, what was the error in workers about the new packages list? [20:16:55] puppet dumbness [20:17:00] ok :) [20:17:15] packages were fine [20:17:16] :) [20:17:22] right [20:24:06] leaving for today a-team, will see ya tomorrow :) [20:25:00] laters joal