[08:06:09] morning joal! [08:12:35] addshore: he is on vacation :) [08:12:41] ahh! :D [08:27:17] (PS1) Addshore: Add script counting global betafeature usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298926 (https://phabricator.wikimedia.org/T140229) [08:38:15] (PS1) Addshore: Add script counting global betafeature usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298927 (https://phabricator.wikimedia.org/T140229) [08:38:23] (CR) Addshore: [C: 2] Add script counting global betafeature usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298927 (https://phabricator.wikimedia.org/T140229) (owner: Addshore) [08:38:34] (CR) Addshore: [C: 2] Add script counting global betafeature usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298926 (https://phabricator.wikimedia.org/T140229) (owner: Addshore) [08:38:49] (Merged) jenkins-bot: Add script counting global betafeature usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298927 (https://phabricator.wikimedia.org/T140229) (owner: Addshore) [08:38:59] (Merged) jenkins-bot: Add script counting global betafeature usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298926 (https://phabricator.wikimedia.org/T140229) (owner: Addshore) [09:52:47] Analytics-Tech-community-metrics, Developer-Relations (Jul-Sep-2016): Create basic/high-level Kibana (dashboard) documentation - https://phabricator.wikimedia.org/T132323#2461414 (Aklapper) a:Dicortazar>Aklapper Stealing this task from @Dicortazar as it's a good exercise for everybody - me trying... [09:53:23] Analytics-Tech-community-metrics, Developer-Relations (Jul-Sep-2016): Play with Bitergia's Kabana UI (which might potential replace our current UI on korma.wmflabs.org) - https://phabricator.wikimedia.org/T127078#2461420 (Aklapper) stalled>Open [10:16:18] (PS1) Addshore: Use new config vars [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298937 [10:17:01] (PS1) Addshore: Use new config vars [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298938 [10:17:09] (PS2) Addshore: Use new config vars [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298938 [10:17:37] (CR) Addshore: [C: 2] Use new config vars [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298937 (owner: Addshore) [10:17:57] (Merged) jenkins-bot: Use new config vars [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298937 (owner: Addshore) [10:23:41] (PS1) Addshore: Add dbname to beta feat count query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298942 [10:24:02] (PS2) Addshore: Add dbname to beta feat count query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298942 [10:24:28] (PS1) Addshore: Add dbname to beta feat count query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298943 [10:34:22] (CR) Addshore: [C: 2] Add dbname to beta feat count query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298942 (owner: Addshore) [10:34:27] (CR) Addshore: [C: 2] Add dbname to beta feat count query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298943 (owner: Addshore) [10:34:43] (Merged) jenkins-bot: Add dbname to beta feat count query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298942 (owner: Addshore) [10:34:54] (Merged) jenkins-bot: Add dbname to beta feat count query [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298943 (owner: Addshore) [12:18:26] hi milimetric, I'll be here in 15 mins, need to buy sth [12:22:19] oh man, Nemo_bis that's crazy! Thanks for spotting it, I'll see if there's anything I can do like kill whatever madness is spiking that cpu [12:23:37] what?! there are like 100 instances of nc running... [12:30:09] ok, fixed: https://tools.wmflabs.org/nagf/?project=analytics#h_limn1 [12:30:24] no idea what that nc crap was about, I asked in labs, anyway, moving on [12:34:51] milimetric: o/ anything that I can help with? [12:35:11] I should've probably ps auxfw before I killall, but too late now :) [12:36:07] the load was apparently 500: [12:36:07] https://graphite.wmflabs.org/render/?title=limn1+Load+last+day&width=800&height=250&from=-1day&hideLegend=false&uniqueLegend=true&target=alias%28color%28stacked%28analytics.limn1.loadavg.01%29%2C%22%23bbbbbb%22%29%2C%221-min%22%29&target=alias%28color%28analytics.limn1.loadavg.processes_running%2C%22%232030f4%22%29%2C%22Procs%22%29 [12:36:46] there were just a ton of nc processes each taking up 2% of the CPU [12:37:08] the only nagging thought is that it's some sort of attack since nc is like a swiss army knife [12:38:58] what host? [12:39:32] I am not familiar with limn in labs.. :/ [12:42:48] milimetric, back [12:42:54] milimetric, thanks for looking into that in my ops week [12:43:19] attack to limn1? [12:47:57] heh, I doubt it, sorry elukey it's limn1 in the analytics project, but don't worry, that host is 100% unsupported by anyone, long long story [12:49:13] mforns: I'll be back in like 20 minutes, but I haven't run the new algorithm yet, so I'll probably be doing that for a while [12:51:20] milimetric, ok, np I will continue with EL schemas [12:51:31] milimetric: don't have access to the host.. weird [12:51:48] I am curious about nc, there must be something in the logs.. [13:15:01] elukey: oh sorry, so to access that host you need root login, if you're curious we can batcave and you can remote control me [13:15:19] I think root login is given on a per-person basis, but you can try ssh root@limn1.eqiad.wmflabs [13:21:14] milimetric: no no don't worry, just wanted to check logs and see if anybody tried some sort of injection [13:21:31] ooh! how would one check that? [13:21:40] I'd be interested to know, learn something [13:22:59] milimetric: I don't recall what limn is used for, but if it accepts any kind of HTTP request I'd check those logs to see if nc was passed as argument [13:23:14] let's say that you have an exec or something similar in the code [13:25:07] aha, yeah it's apache virtual directories pointing to node servers, but there's also a defunct un-updated piwik instance on there and lyra, yeah, it's not worth it, I will check the logs if I have some downtime later, working on that algorithm now [13:30:37] also nc might be a good strategy to open ports and enter in the system [13:32:02] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2461880 (elukey) We are now in pending verification of fix, let's see if oozie will not complain during the next couple of days. It... [13:39:58] thanks milimetric [13:48:33] ottomata: can we add a user to analytics1027 called refinery-engineer? [13:49:03] "refinerer" doesn't make sense [13:49:48] refinery-worker [13:50:39] refinery is good as well but it is boring [13:51:05] haha [13:51:08] elukey: for scap? [13:52:29] yes :) [13:52:37] it will own also all the files [13:52:49] like eventlogging does on eventlog1001 [13:52:52] rather than root [13:54:56] elukey: why not deploy-analytics [13:54:56] ? [13:55:00] then we can reuse it for other stuff [13:55:22] buuuuuuuu [13:55:34] haha? [13:55:41] ok all right that makes sense but more boring [13:55:42] :P [13:56:19] :) [13:57:17] Analytics, Revision-Slider, TCB-Team, WMDE-Analytics-Engineering, and 4 others: Data need: Explore range of article revision comparisons - https://phabricator.wikimedia.org/T134861#2461915 (WMDE-Fisch) [13:57:19] ottomata: would it make sense to add it to the refinery role? I didn't find any module that would need to own it [13:57:30] group + user [13:59:44] hm [14:00:35] elukey: role::analytics_cluster::users [14:00:35] ? [14:01:57] * elukey will discover all the analytics roles sooner or later [14:02:28] it seems a bit out of the scope no? [14:03:02] maybe I can check who uses role::analytics_cluster::users [14:03:45] ah analytics100[12] [14:04:07] maybe we can create a new role::analytics_deploy::users ? [14:04:27] hmmm [14:04:32] i thought it was in more places, but you are right [14:05:03] elukey: i guess refinery role is fine for now [14:05:12] put a comment on it saying it should find a more generic place to live when we need it to [14:05:24] super, maybe I can add a comment about this discussion [14:05:28] k [14:05:50] i mean, we could put it in role::analytics_cluster::client [14:06:19] dunno [14:06:23] refinery role i guess is good for now [14:29:55] elukey: thanks for the merge, sir! [14:31:42] urandom: thanks for the sir! :) [14:32:10] heh [14:32:38] elukey: i owe you so many beers (or s/beer/$beverage_of_choice/) [14:33:08] now all i need is an opportunity to pay up! :) [14:33:48] mforns: it didn't work at all, it's much worse than before, now a bunch of things don't match up and the histories get truncated [14:33:59] I don't get it... I'm going to take some time off and regroup [14:34:01] milimetric, oh..... pfff [14:34:13] then maybe we can brainbounce before standup? [14:34:19] sure milimetric [14:34:33] * urandom googles brainbounce [14:35:06] "did you mean brain balance?" [14:35:10] it's like a more advanced version of rubber ducking [14:35:20] * urandom googles rubber ducking [14:35:31] in my case the rubber duck is usually smarter [14:35:41] https://en.wikipedia.org/wiki/Rubber_duck_debugging [14:35:48] yep :) [14:36:31] wow. [14:36:38] also, for unrelated reasons it always makes me think of this: http://media.mnn.com/assets/images/2015/08/Friendly-Floatees.jpg.838x0_q80.jpg [14:36:48] http://www.mnn.com/earth-matters/wilderness-resources/stories/what-can-28000-rubber-duckies-lost-at-sea-teach-us-about [14:36:52] how do you rubber duck debug without coming across as bat-shit insane to everyone around you? [14:36:59] urandom: you owe me beers? I think me and joal owe you tons more :) [14:37:10] oh, we only hire insane people, so that's not an issue [14:37:19] touche [14:38:05] * elukey is mildly offended by milimetric's statement :P [14:39:11] cool rubber duck story :] [14:55:51] Analytics, Analytics-Cluster, Analytics-Kanban, Deployment-Systems, and 3 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2462267 (elukey) @thcipriani thanks! I have another doubt: we'd want to use the deploy-analytics user to perform the deployment but I... [15:00:17] milimetric: mforns, quick checkin about some changes i'm making around redirects? [15:01:31] ottomata, sure [15:01:40] wait for dan? [15:02:08] naw we can just check in i think [15:02:09] s'ok [15:02:10] in batcave [15:02:18] ok [15:03:19] brt [15:20:58] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2462364 (elukey) Seems working! After 11 UTC no more empty dt fields. ``` ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatal... [15:22:22] no more dt:- after the varnishkafka timeout increase [15:22:26] \o/ [15:30:58] super NICE! elukey [15:31:08] ta-ta-channnnn [15:31:27] yeehaw! [15:31:41] elukey, o///////// [16:02:07] I'm listening to other q. Review [16:55:03] people logging off, talk with you tomorrow! o/ [16:55:40] nite elukey [16:56:12] elukey, bye! [17:28:07] Analytics-Kanban, EventBus, Patch-For-Review: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need - https://phabricator.wikimedia.org/T134502#2462987 (Ottomata) There are a lot of ongoing discussions about proper schemas. Holding on the work for this while we disc... [17:35:12] bye! [17:52:02] haha, guess we should productionize pivot :p [17:55:55] ottomata: yes :P and put behind ldap :P [18:14:29] madhuvishy: right right, it has to be behind LDAP so pms can use it [19:43:56] (PS1) Nuria: Correcting node install instructions on AQS README [analytics/aqs] - https://gerrit.wikimedia.org/r/299025 [19:53:09] (CR) Milimetric: "small typo - you should self merge after that" (1 comment) [analytics/aqs] - https://gerrit.wikimedia.org/r/299025 (owner: Nuria) [19:55:15] oh nuria_ you want to restart RMs wtih me? [19:55:27] ottomatayessir: [19:55:33] ottomata: yessir, sorry [19:55:45] ok! [19:55:47] ottomata: from 1002? [19:56:57] ja! [19:57:03] you can just restart that one [19:57:03] now [19:57:10] sudo service hadoop-yarn-resourcemanager restart [19:57:15] don't forget to !log it here and in #ops [19:57:50] (PS2) Nuria: Correcting node install instructions on AQS README [analytics/aqs] - https://gerrit.wikimedia.org/r/299025 (https://phabricator.wikimedia.org/T136016) [19:58:38] ottomata: for the log change did we run puppet [19:58:47] ottomata: or is it run on its own [19:58:47] ? [19:58:50] it is run on its own [19:58:53] you can check to see [19:58:55] ottomata: ah ok [19:59:13] in /etc/hadoop/conf/mapred-site.xml [19:59:15] to see if you change is in there [19:59:18] or was it yarn-site? [19:59:19] heh [19:59:22] one of those [19:59:36] grep -C 5 aggregation /etc/hadoop/conf/*.xml [19:59:36] heh [19:59:48] yarn site [19:59:53] yarn.log-aggregation.retain-check-interval-seconds [19:59:54] 86400 [19:59:55] looks good [20:00:47] ottomata: yarn-site [20:01:11] ottomata: ah sorry, delay [20:01:35] aye [20:01:37] ottomata: we will also need to restart jobs in hue right? [20:01:40] nope [20:01:45] nuria_: so 1002 is inactive [20:01:50] so that is safe to do righ tnow, won't affect anything [20:02:04] when we restart 1001, it active RM will switch to 1002 [20:02:09] ottomata: ah , i do not have sudo [20:02:11] then we restart 1002 again to get it to switch back [20:02:14] ohhh ja [20:02:14] haha [20:02:17] you can sudo to hdfs [20:02:20] but not sudo for serices [20:02:21] aye [20:02:21] oook [20:02:24] am doing then [20:02:49] ottomata: k [20:02:53] !log restarting hadoop-yarn-resourcemanager on analytics1002 and then analytics1001 to apply yarn log aggregation change [20:03:19] not sure if the best place but added that cmd here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#ResourceManager [20:03:57] nuria_: if you need to, add it up in the High Availbility section where it talks about ResourceManager [20:04:22] k looks good on 1002 [20:04:24] doing on 1001 [20:04:32] after we do on 1001, we should watch oozie jobs in hue and make sure they are good [20:04:34] they should be though [20:04:55] quickly restarting (and auto failing over RM) is i think even less dangerouse than restarting all the nodemanagers [20:04:59] since we don't lose any app master this way [20:05:05] ottomata: done [20:05:55] ottomata: corrected docs, looking at oozie [20:06:37] nuria_: looks perfect [20:06:42] well, from RM view [20:06:46] 1002 is now master [20:06:53] i'm going to now restart 1002 one more time to put it back on 1001 [20:08:22] ottomata: k [20:08:29] done [20:08:47] ok back to normal state, both RMs have been restarted [20:08:47] ottomata: and ... [20:08:50] now we wait another day... [20:08:51] :) [20:08:58] how do we know 1002 and 1001 run resource manager? [20:09:04] oozie looks fine [20:09:04] ottomata: that is on puppet? [20:09:04] https://hue.wikimedia.org/oozie/list_oozie_workflows/ [20:09:07] nuria_: yes [20:09:16] they are the hadoop master nodes, they run both RM and NameNodes [20:09:23] ottomata: ajam [20:09:25] 1001 is default master, so we try to keep it always in charge [20:09:30] 1002 is standby failover [20:09:32] so [20:09:40] in puppet you see that 1001 is ::master and 1002 is :;standby [20:09:54] ottomata: aham [20:09:57] https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp#L36 [20:11:19] ottomata: k, let's scheck tomorrow [20:11:21] *check [20:11:54] yup [20:15:54] (PS3) Nuria: Correcting node install instructions on AQS README [analytics/aqs] - https://gerrit.wikimedia.org/r/299025 (https://phabricator.wikimedia.org/T136016) [20:16:45] (PS4) Nuria: Correcting node install instructions on AQS README [analytics/aqs] - https://gerrit.wikimedia.org/r/299025 (https://phabricator.wikimedia.org/T136016) [20:17:25] (CR) Nuria: [C: 2 V: 2] "Self merging docs." [analytics/aqs] - https://gerrit.wikimedia.org/r/299025 (https://phabricator.wikimedia.org/T136016) (owner: Nuria) [20:43:32] Analytics-Cluster, Analytics-Kanban, EventBus, Operations, Services: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2463951 (Ottomata) a:Ottomata [20:52:00] Analytics, DBA, Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#2463961 (mforns) @jcrespo I did some additions to the white list in the gerrit change set. But I still need the confirmation of 1 schema owner. Please do not activate the purging un... [21:33:28] Analytics, DBA, Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#2464200 (mforns) @jcrespo Ok, it's confirmed. We can proceed :] [21:37:07] bye team! see you tomorrow [22:26:35] Analytics, Research-and-Data-Backlog: Improve bot identification at scale - https://phabricator.wikimedia.org/T138207#2464478 (DarTar)