[05:52:41] 06Analytics-Kanban, 10Wikimedia-Stream: Port RCStream clients to EventStreams - https://phabricator.wikimedia.org/T156919#3052314 (10Xqt) [07:08:22] 10Analytics-Tech-community-metrics: Deployment of IRC panel - https://phabricator.wikimedia.org/T138004#3052323 (10Lcanasdiaz) Our developers are working on this. I'm going to check whether they fixed for the release published yesterday. If not, we'll have to wait until next one (done every Thursday) {icon frown-o} [08:37:13] 06Analytics-Kanban, 15User-Elukey: Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap - https://phabricator.wikimedia.org/T157354#3052405 (10elukey) Started https://wikitech.wikimedia.org/wiki/Incident_documentation/20170223-AQS [08:38:14] I created https://wikitech.wikimedia.org/wiki/Incident_documentation/20170223-AQS to collect info about the outage [10:03:26] (03CR) 10Joal: "Comments inline." (033 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/338898 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans) [10:55:19] 06Analytics-Kanban: Hive code to count global unique devices per top domain (like *.wikipedia.org) - https://phabricator.wikimedia.org/T143928#3052659 (10JAllemandou) >>! In T143928#3041187, @Nuria wrote: > I think we probably need to take a second look at this calculation, compare the wikidata numbers with the... [11:15:49] 06Analytics-Kanban, 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3052693 (10elukey) >>! In T154558#3043971, @JoeWalsh wrote: > @Milimetric this UA is from the iOS app. In testing locally, I didn't see... [11:31:05] 06Analytics-Kanban, 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3052728 (10elukey) More numbers about number of requests landing to piwik/apache/bohrium and failed ones (503s). The following numbers... [12:01:12] (03PS4) 10Fdans: Add secondary table endpoint to populate Cassandra with correct timestamps [analytics/aqs] - 10https://gerrit.wikimedia.org/r/338898 (https://phabricator.wikimedia.org/T156312) [12:03:52] (03PS8) 10Joal: Port standard metrics to reconstructed history [analytics/refinery] - 10https://gerrit.wikimedia.org/r/322103 (owner: 10Milimetric) [12:10:02] * elukey lunch! [12:20:02] (03CR) 10Joal: [C: 031] "LGTM ! Waiting for @milimetric approval :)" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/338898 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans) [12:23:07] (03CR) 10Milimetric: [V: 032 C: 032] "nice" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/338898 (https://phabricator.wikimedia.org/T156312) (owner: 10Fdans) [12:23:28] I'll leave it to yall to submit and deploy [12:23:53] milimetric: awesome :) [12:24:07] milimetric, fdans : Not on friday ( or elukey will be after us) [12:24:42] joal: but merging ok? or do we leave that for monday as well? [12:24:47] thank you milimetric [12:24:52] fdans: I don't mind merging :) [12:28:54] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:44] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [12:31:44] * elukey silently observe [12:32:00] * elukey should not say anything after an outage of 1 hour [12:32:14] * elukey cries in a corner [12:36:27] * mforns brings chocolate to elukey [12:36:30] * joal hugs elukey [12:36:56] * fdans engages in a group hug [12:37:56] I'm so having pizza for lunch [12:41:37] ahahah I love my team [12:43:55] milimetric: would you give me some brain power on metrics ? [12:44:12] meaning standard metrics, and how we plan to serve them [12:44:45] we haven't figured that out at all, joal, did you want to brainstorm for a bit? [12:45:01] yessir, if you have time [12:45:03] k, give me a sec to get pants on [12:45:07] huhu :) [12:45:16] it's early for you, do you prefer later? [12:46:48] joal: in da cave [13:08:19] (03PS8) 10Mforns: Add spark job to aggregate historical projectviews [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/337593 (https://phabricator.wikimedia.org/T156388) [13:33:34] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL: 62.71% of data above the critical threshold [3686.4] [13:35:34] RECOVERY - HDFS active Namenode JVM Heap usage on analytics1001 is OK: OK: Less than 60.00% above the threshold [3276.8] [13:40:34] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL: 62.71% of data above the critical threshold [3686.4] [13:47:08] taking a break a-team, later [13:47:15] o/ [13:47:36] what are those alarms?? Active namenode?? :O :O [13:47:47] ahhh heap usage! \o/ [14:04:34] RECOVERY - HDFS active Namenode JVM Heap usage on analytics1001 is OK: OK: Less than 60.00% above the threshold [3276.8] [14:10:09] (03PS2) 10Mforns: Add script to generate WSC abbrevs to domain map [analytics/refinery] - 10https://gerrit.wikimedia.org/r/338786 (https://phabricator.wikimedia.org/T158330) [14:16:43] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?from=now-3h&to=now [14:16:48] it works! [14:17:03] effectively the namenode is using a lot of its heap [14:29:34] 06Analytics-Kanban, 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3053061 (10Milimetric) It seems to me you can close this task and open up a new one to investigate Varnish / Apache problems (as those a... [15:09:15] joal: I fixed some graphs in https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?from=now-30d&to=now [15:10:08] the namenode seems to reach the top once in a while [15:10:13] that is not a big deal of course [15:10:40] it triggers a old gen run that fixes the issue [15:10:49] buuuuut we might thing about increasing the Xmx [15:16:00] (03PS2) 10Mforns: Add oozie workflow to load projectcounts to AQS [analytics/refinery] - 10https://gerrit.wikimedia.org/r/339421 (https://phabricator.wikimedia.org/T156388) [15:20:36] 10Analytics, 06Research-and-Data: geowiki data for Global Innovation Index - https://phabricator.wikimedia.org/T131889#3053148 (10leila) >>! In T131889#3051159, @Rafaesrey wrote: > > i) would reporting back only the scores not be enough to > codify the totals for countries with less than 100k? By this I mean... [15:24:07] ok tuned the alarms to 1) alert only analytics 2) use higher thresholds (WARNING 90% of the heap used, CRITICAL 95%) [15:35:18] 10Analytics, 06Research-and-Data: geowiki data for Global Innovation Index - https://phabricator.wikimedia.org/T131889#3053156 (10Rafaesrey) Leila, Thanks for this reply. Let's work on the initial set of ~71 and perhaps then explore if the <100k threshold can be lowered slightly to achieve an increase in cove... [15:41:42] milimetric mforns fdans https://www.dropbox.com/s/m4oku8x5ta43sv6/Screenshot%202017-02-24%2010.41.18.png?dl=0 [15:41:45] thoughts? [15:41:54] ashgrigas, looking [15:42:38] looking! [15:44:17] just look at topic selector [15:46:17] ashgrigas, I really like the visuals of it [15:47:26] ashgrigas, can I quickly draw the idea I had and send it to you? I think it could match with the concept you propose, but it's difficult to express in words [15:47:37] do you think the minimized state is ok or would you want it to collapse completely? [15:47:51] yes please send mforns [15:48:30] ok, will take a while, because we have a meeting now, but I send it in the next couple hours [15:48:35] ashgrigas, ^ [15:48:53] ok [15:48:59] sounds good [15:53:20] ashgrigas: looks great to me, I personally think the navigation will be really nice this way [15:53:49] enough ways to get everywhere no matter what your background [15:55:09] milimetric: we foound the issue with piwik's 503s \o/ [15:55:17] :) woah [15:55:19] what was it [15:55:53] a timeout in varnish that apparently seems to close TCP connections idling for 5 seconds, and causing socket read() errors (EOF) [15:56:01] completely unrelated from piwik [15:57:14] Note to self: Never ever talk to elukey about problems you don't wanna see solved [15:57:17] oh interesting, glad it turned out to be generally useful [15:58:55] good call joal. So, elukey, the next thing bothering me is this whole "we can't travel faster than the speed of light" thing. Thoughts? [15:59:40] :] [16:08:20] urandom: o/ - this is really suspicious https://phabricator.wikimedia.org/P4981 [16:09:18] joal: --^ [16:19:22] urandom: is it possible that the source of the problem is the alter table? [16:19:50] urandom: this would rule out the repairs, and explain why I was seeing in some instances the old settings [16:24:25] joal, do you have 10 mins to talk about cassandra workflow? [16:24:36] mforns: in meeting in 5 minutes [16:24:43] 5 minutes now ? [16:24:51] joal, no no, let's meet after [16:24:57] ping me joal :] [16:25:00] k mforns :P) [16:30:34] (03PS1) 10Joal: Correct webrequest comments [analytics/refinery] - 10https://gerrit.wikimedia.org/r/339661 (https://phabricator.wikimedia.org/T157951) [16:34:24] 06Analytics-Kanban, 13Patch-For-Review: Fix description of webrequest table - https://phabricator.wikimedia.org/T157951#3053298 (10JAllemandou) a:03JAllemandou [16:46:08] elukey: umm, that log output sure does look scary; i'm not sure if it means what it says it means, though [16:46:44] elukey: i *hope* it doesn't, otherwise, sheeesh [16:51:17] elukey: is that a verbatim paste from the logs? [16:51:23] elukey: whitespace and all? [16:54:18] yes yes [16:54:22] urandom: --^ [16:58:33] elukey: still reading the code, but i think the second form is meant to only contain the changed bits, it's what is used to create a mutation that is merged with the existing schema [16:59:04] 10Analytics: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3053334 (10JAllemandou) [17:00:54] 10Analytics: productionize recommendation vectors - https://phabricator.wikimedia.org/T158973#3053346 (10JAllemandou) [17:01:09] 10Analytics: productionize recommendation vectors - https://phabricator.wikimedia.org/T158973#3053358 (10JAllemandou) [17:01:14] mforns: I'm ready ! [17:01:23] hey joal batcave? [17:01:29] sure mforns [17:06:53] elukey: yeah, i just tested it, that output is "normal" [17:07:31] elukey: also, if we took that literally, then i think you'd have had more problems [17:08:07] elukey: that would seem to indicate the tables themselves were removed by the alter, including any data that was in there [17:08:32] you'd have had to recreate the tables, default role(s) and superuser [17:09:07] urandom: all right thanks, back to log diving then :( [17:12:56] elukey: here? [17:12:59] elukey, hey do you have 5 mins? [17:13:12] joal, elukey is mine :] [17:13:25] No, not true, he's MY PRECIOUS ! [17:13:47] GolUM, GOlum, GOLUM ! [17:14:09] 🍻 [17:15:50] ahahah [17:16:01] joal: sorry just read the message, I am here [17:16:12] elukey: batcave/ [17:16:13] ? [17:16:25] sure! [17:44:25] 06Analytics-Kanban, 15User-Elukey: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#3053429 (10elukey) [17:44:38] milimetric: --^ will check on Monday! [17:45:41] thanks elukey, no rush, especially since I'm not around Monday [17:45:44] have a good weekend [17:51:04] 06Analytics-Kanban, 15User-Elukey: Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap - https://phabricator.wikimedia.org/T157354#3053449 (10elukey) Saved all the system-{a,b} logs contained in each host (/var/log/cassandra/system..) to /home/elukey/outage_logs/ fo... [17:51:22] urandom: saved all the logs on each aqs node in /home/elukey/outage_logs/ so we will not loose them [17:51:31] going to do a bit more log diving on monday [17:53:30] 10Analytics, 10Analytics-Wikistats, 10Labs-project-Wikistats: miraheze custom domains not updated on wikistats - https://phabricator.wikimedia.org/T158976#3053450 (10Reception123) [17:53:38] 10Analytics, 10Analytics-Wikistats, 10Labs-project-Wikistats: miraheze custom domains not updated on wikistats - https://phabricator.wikimedia.org/T158976#3053464 (10Reception123) p:05Triage>03Normal [17:54:37] * elukey afk!! byeeee o/ [18:05:51] 10Analytics, 10Analytics-Wikistats, 10Labs-project-Wikistats: miraheze custom domains not updated on wikistats - https://phabricator.wikimedia.org/T158976#3053484 (10Reception123) details also T153930 [20:20:33] bye team see you in one week [20:20:36] :] [21:28:58] 06Analytics-Kanban, 06Research-and-Data: Coordinate with research to vet metrics calculated from the data lake - https://phabricator.wikimedia.org/T153923#3054051 (10Nuria) Research is short on resources this quarter, thus we were planning on tackling this end of quarter on early next quarter. cc @DarTar @leila