[00:39:40] 10Analytics, 10Discovery-Analysis, 10Product-Analytics, 10Reading-analysis: Productionize per-country daily & monthly active app user stats - https://phabricator.wikimedia.org/T186828#4152057 (10Tbayer) >>! In T186828#4109650, @Nuria wrote: > Our preferred path here is that the data analysts team initiates... [04:13:39] (03PS2) 10Nuria: Enabling editing metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/428523 (https://phabricator.wikimedia.org/T192841) (owner: 10Milimetric) [04:23:33] 10Analytics, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4152195 (10Nuria) I re-run indexation for 2018-02 snapshot and now segment sizes in druid are what i would expect, about 2 G per segment. sudo -u hdfs oozie job --oozie $OOZIE_URL -Dr... [04:47:47] (03CR) 10Nuria: [V: 032 C: 032] Enabling editing metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/428523 (https://phabricator.wikimedia.org/T192841) (owner: 10Milimetric) [05:02:09] (03PS1) 10Nuria: Wikistats Release [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/428556 [05:02:54] (03CR) 10Nuria: [V: 032 C: 032] Wikistats Release [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/428556 (owner: 10Nuria) [05:33:46] (03PS1) 10Nuria: Release 2.2.5 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/428559 [05:34:58] (03CR) 10Nuria: [V: 032 C: 032] Release 2.2.5 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/428559 (owner: 10Nuria) [05:38:56] 10Analytics, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4152235 (10Nuria) Enabled editing metrics and deployed: https://gerrit.wikimedia.org/r/#/c/428559/ [05:39:05] 10Analytics, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4152236 (10Nuria) p:05Triage>03Unbreak! [05:39:15] 10Analytics, 10Analytics-Wikistats, 10cloud-services-team (Kanban): Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4152238 (10Nuria) [05:39:49] 10Analytics, 10Analytics-Wikistats, 10cloud-services-team (Kanban): Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4151252 (10Nuria) [05:40:00] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4151252 (10Nuria) [06:05:22] 10Analytics, 10Patch-For-Review, 10User-Elukey: Report updater setting log ownership incorrectly (leading to cronspam) - https://phabricator.wikimedia.org/T191871#4152260 (10elukey) This change should always create a file with 644 perms and correct username/group after every rotation. Keeping it open for a c... [06:22:17] morning! [06:22:25] so I just disabled the middlemanager on druid1003 [06:22:35] so we'll be ready when we want to upgrade druid [06:39:02] Hi elukey [06:39:35] elukey: I'm sorry I'm gonna disrupt our plans :( [06:40:43] elukey: family and I have an opportunity to go and sail today, and it's a rather not so common one, so if you don't mind too much I'm gonna take the day off [06:41:16] I'll work late tonight when coming back , but it'll postpone our druid upgrade to tomorrow if you want me to be present [06:45:06] joal: wow how could I mind! Have a good day Joseph! [06:45:34] Thanks a lot mate :) [06:45:36] tomorrow is liberation day in Italy so I might not be present, buuut I'll do the upgrade with Andrew this afternoon in case! [06:45:48] don't worry and have a good sailing :) [06:46:08] elukey: I've seen you've working late yesterday to fix and make us able to deploy today -And now I postpone for personal reasons - Feeling a bit mwarf [06:46:28] Ack elukey for tomorow [06:46:46] elukey: I'll also be connected this evening/nifght, I'll double check with andrew in any case [06:47:21] joal: don't even say that! I can't imagine anything better than sailing with family (and I believe that it will be the first time for your little ones :) [06:48:00] Indeed ! Lino has gone onto a boat once, but it was with motor, and NaƩ has never done anything on teh sea :) [06:48:44] :) [06:51:07] elukey: wow didin't notice - WKS2 data is bad :( [06:51:12] What a badf day to go missing [06:51:45] ah I didn't notice it too :( [06:53:19] I think I know what happened [06:53:23] Reindexing now [06:53:31] I'm sorry that's me having messed up [06:54:13] well we are in alpha/beta-ish now, it is fine in my opinion if we are still fixing up things :) [06:54:53] !log Manually reindexing all of mediawiki-history for snapshot 2018-03 after having messed it with job testing [06:54:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:55:00] !log Reindextion job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0033855-180330093100664-oozie-oozi-C/ [06:55:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:57:17] !log correct reindextion job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0033859-180330093100664-oozie-oozi-C/ [06:57:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:01:39] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats: Data Regression - https://phabricator.wikimedia.org/T192840#4152317 (10JAllemandou) @Milimetric and @nuria: This problem is due to me having testing the mediawiki-reduced new job, without disabling the indexation part o... [07:03:40] elukey: I'm gone - Later ! [07:06:20] o/ [08:10:42] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4152452 (10JAllemandou) @nuria actions fixed the problem for data up to 2018-02. I restarted a job ending in 2018-03 as the problem is not related to snapshots but to... [08:48:18] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4152565 (10elukey) Note: in netboot.cfg the kafka[12]00[123] are set with raid10-gpt-srv-ext4.cfg, that as it is formats... [08:53:26] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4152578 (10elukey) Since we are doing this work, I'd also add the `interface::add_ip6_mapped { 'main': }` puppet config... [10:28:39] * elukey bbr [10:28:44] err brb [10:42:47] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats bug: deploy causes bad merges - https://phabricator.wikimedia.org/T192890#4153014 (10Milimetric) [10:43:02] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats bug: deploy causes bad merges - https://phabricator.wikimedia.org/T192890#4153028 (10Milimetric) p:05Triage>03High [10:44:03] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4153036 (10Milimetric) [10:44:08] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats: Data Regression - https://phabricator.wikimedia.org/T192840#4153038 (10Milimetric) [10:45:02] milimetric: good morning :) [10:45:07] morning elukey [10:45:47] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats bug: deploy causes bad merges - https://phabricator.wikimedia.org/T192890#4153049 (10Milimetric) a:03Milimetric [10:46:04] milimetric: whenever you are caffeinated/ready/etc.. (it is really early in NYC :) if you have time we could upgrade druid analytics to 0.10 [10:46:31] elukey: so we had that data outage yesterday [10:46:43] not sure if that means we should stop the upgrade until it's stable [10:47:23] upgrading wouldn't directly affect us as we work on that, because AQS hits the public cluster, but it might be nice to have the two clusters be similar so we can do any testing on the private one [10:47:25] I was convinced that Joseph re-launched indexing [10:47:40] ah right right [10:47:48] nevermind then :) [10:48:02] elukey: he relaunched it from even earlier, the denormalize job [10:48:16] but I'm not 100% convinced it's early enough, we may need to sqoop again [10:48:31] I don't think we know exactly what happened [10:48:39] * elukey misread "denormalize" with "deneuralize", too many MIB movies [10:48:41] anyway, yeah, meantime maybe we should take a break from upgrading [10:48:44] :) [10:48:51] I don't remember what you're talking about [10:49:06] the deneuralizer! [10:49:12] :D [10:50:41] ok I am going to reimage a couple of analytics worker nodes then [10:50:45] an1066/65 [11:26:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4153224 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1065.eqiad.wmnet', 'an... [12:05:52] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4153321 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1065.eqiad.wmnet', 'analytics1066.eqiad.wmnet'] ``` and were **ALL** su... [12:07:27] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4153323 (10JAllemandou) Job finished, data is up to date. Thanks @nuria and @milimetric for having spotted the problem and quick fix it ! [12:08:55] !log restart webrequest-load-wf-text-2018-4-24-9 via Hue (failed due to reimages) [12:08:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:17:24] 10Analytics, 10Operations, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153360 (10Gilles) [12:17:41] 10Analytics, 10Operations, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153360 (10Gilles) [12:23:56] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4153402 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1063.eqiad.wmnet', 'an... [12:24:00] reimaging 106[34] now [12:24:07] some jobs might fail [12:24:15] will restart them in case [12:52:32] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4153463 (10Milimetric) 05Open>03Resolved Indeed, confirmed all looks good, I'll put this in code review so we can remember to talk about what happened. [12:53:05] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4153465 (10Milimetric) 05Resolved>03Open oops, closed by accident. [12:54:44] PROBLEM - Hadoop DataNode on analytics1063 is CRITICAL: NRPE: Command check_hadoop-hdfs-datanode not defined [12:55:07] expired downtime [12:57:37] elukey: thanks for restarting that webrequest workflow, do you know how come hue doesn't understand those and can't display them? Is there some way to relaunch such that we can see them here: https://hue.wikimedia.org/oozie/list_oozie_workflow/0034113-180330093100664-oozie-oozi-W/?coordinator_job_id=0002978-180213194340619-oozie-oozi-C&bundle_job_id=0002976-180213194340619-oozie-oozi-B [13:01:47] RECOVERY - Hadoop DataNode on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [13:01:53] milimetric: it is like that since we did the last cdh upgrade, we have never investigated :( [13:01:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4153487 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1064.eqiad.wmnet', 'analytics1063.eqiad.wmnet'] ``` and were **ALL** su... [13:08:27] I finally managed to find a way to add JMX to the journal nodes [13:08:29] yesssss [13:08:34] * elukey opens a task [13:13:08] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Add JMX and Prometheus monitoring to Hadoop Journal nodes' JVMs - https://phabricator.wikimedia.org/T192905#4153521 (10elukey) [13:15:01] (03CR) 10Fdans: [V: 032 C: 032] Add all changes to repo since Feb 2016 [analytics/ua-parser/uap-core] - 10https://gerrit.wikimedia.org/r/427415 (https://phabricator.wikimedia.org/T192465) (owner: 10Fdans) [13:37:52] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add JMX and Prometheus monitoring to Hadoop Journal node JVMs - https://phabricator.wikimedia.org/T192905#4153603 (10elukey) [13:38:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add JMX and Prometheus monitoring to Hadoop Journal nodes - https://phabricator.wikimedia.org/T192905#4153507 (10elukey) [13:44:32] o/ [13:45:08] o/ [13:48:52] 10Analytics, 10Operations, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153360 (10fgiunchedi) We're not backing up graphite's, though metrics are mirrored to codfw too so we can copy back from there. Which files you need? [13:49:18] 10Analytics, 10Operations, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153650 (10Gilles) I nuked codfw as well... [13:49:37] 10Analytics, 10Operations, 10Graphite: Restore Graphite whipser data from April 23th - https://phabricator.wikimedia.org/T192899#4153651 (10Gilles) 05Open>03Invalid [13:53:53] 10Analytics-Kanban, 10Patch-For-Review: Checklist for geowiki pipeline - https://phabricator.wikimedia.org/T190409#4072764 (10Milimetric) p:05Triage>03Normal [14:07:23] * elukey afk for a bit! [14:30:37] (03CR) 10Ottomata: [V: 032 C: 032] Point uap-core submodule to latest commit [analytics/ua-parser] - 10https://gerrit.wikimedia.org/r/427620 (https://phabricator.wikimedia.org/T192464) (owner: 10Fdans) [14:30:48] fdans: oh ya, had an idea about the maxmind stuff if you wanna chat [14:31:06] ottomata: cave? [14:31:13] i was just applying your suggestions [14:31:34] real quick: i wonder, if you do the hardlink thing [14:31:40] you can avoid even checking if the files have changed [14:31:43] not sure if that is what we want [14:31:50] it will mean extra copies in HDFS, but there we probably don't care [14:32:05] locally, you can just make a new timestamp (weekly?) every the cron runs [14:32:08] and cp -rl [14:33:17] also, if possible, it is probably better to take the source, archive, etc. dirs in as CLI arguments [14:33:20] and pass them via the cron [14:33:26] ottomata: I'd merge the two journalnodes jmx+etc.. patches if you are ok [14:33:31] that way you can test the script without affecting the actual arcives [14:33:39] elukey: proceed! [14:33:42] ack! [14:34:29] fdans: does that make sense? [14:34:40] hmmmm I'm thinking [14:35:43] ottomata: if we don't check if the files have changed we're generating a new directory every day with repeat data, no? [14:36:14] well, new dir yes, but maybe that is better? [14:36:17] if we run daily maybe its too much [14:36:22] we wouldn't be creating new files [14:36:33] but that way you could always link against a particular time [14:36:38] even if files haven't changed [14:36:51] instead of always looking up the closest timestamp for the ltimeframe you are interested in [14:37:44] elukey: i merged ya [14:37:57] puppet-merged* [14:38:01] ah thanks! [14:38:22] ottomata: ok, but that would mean, for consistency, creating a directory for each snapshot every period we determine as granularity [14:38:24] since 2014 [14:38:29] hm [14:38:49] i suppose yah. hm, or, i dunno, i don't mind so much if in the past it just has the different changes [14:38:51] if we create it daily is a whole lotta repeated data [14:38:52] i suppose we could do that ya [14:38:57] daily is a lot [14:39:09] but if we do it like, weekly we run the risk of omitting updates [14:39:11] it isn't repeated data on the host tho, just lots of dirs [14:39:40] fdans: the puppet cron only runs weekly [14:39:54] so, we only attempt to download new files from maxmind weekly [14:39:58] oh, I thought we were running it every day [14:40:05] weekday => 0, [14:40:05] hour => 3, [14:40:05] minute => 30, [14:40:11] 3:30 am on Sunday ( i think) [14:40:31] i dunno, maybe its not better [14:40:35] lets ask in P.S.? [14:40:42] it is easier to manage and work with [14:40:47] but you are right, it is a lot more dirs [14:41:21] !log restart hadoop hdfs journalnode on analytics1028 to pick up jmx settings [14:41:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:41:56] ah! journalnode jmx metrics \o/ [14:42:07] yeehaw [15:00:33] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4153901 (10Nuria) a:05JAllemandou>03None [15:01:48] ping fdans [15:01:52] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Change 'NaN' & 'Infinite' to something more helpful in metrics % change over the selected time range - https://phabricator.wikimedia.org/T192028#4124963 (10Nuria) [15:37:12] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4154102 (10jmatazzoni) I see the Editor data now. Thanks. But when I split by editor type, Anonymous users come in as zero. See screenshot. So it looks like anon edit... [16:02:12] ottomata: adding prometheus jmx metrics for the journal nodes! [16:02:31] will do one at the time [16:03:05] 10Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog, and 2 others: Enable Reading List Syncing usage stats - https://phabricator.wikimedia.org/T191859#4154304 (10mpopov) >>! In T191859#4152718, @Tgr wrote: > All these new proposals sound a bit overcom... [16:03:40] 10Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog, and 2 others: Enable Reading List Syncing usage stats - https://phabricator.wikimedia.org/T191859#4154308 (10Nuria) >All these new proposals sound a bit overcomplicated. Why not just use X-Analytic... [16:07:48] do it! [16:08:49] of course I made a typo in the cdh module, fixing it [16:12:03] ok 1028 working fine! [16:12:13] going to triple check and then proceed with the other two [16:13:14] gr8 [16:29:14] 10Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog, and 2 others: Enable Reading List Syncing usage stats - https://phabricator.wikimedia.org/T191859#4118961 (10Ottomata) > would be much easier and simpler to do on the backend side. We're trying to... [16:30:51] !log restart hadoop hdfs journalnode on analytics1035/52 to pick up prometheus jmx settings [16:30:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:32:00] 10Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog, and 2 others: Enable Reading List Syncing usage stats - https://phabricator.wikimedia.org/T191859#4154444 (10Ottomata) BTW, just checking that you are synced up with the iOS team on this. It sounds... [16:42:00] elukey: i think i must be missing something with burrow [16:42:14] i think i see the lag metrics exported, but i don't get them in grafana [16:42:20] for main-eqiad_to_eqiad (analytics) [16:42:47] curl http://kafkamon1001.eqiad.wmnet:9500/metrics | grep total_lag | grep main-eqiad_to_eqiad [16:42:48] works [16:42:50] so they are there [16:42:52] and are being exported [16:43:50] OH maybe it is just my grafana query [16:43:51] hm [16:44:00] OHH right [16:44:01] yes yes [16:44:02] sorry carry on [16:45:35] ahhaha just seen the ping sorry :) [16:57:47] yeah i forgot that i'm not graphing lag where lag is the same as it was a week ago (so we don't show stale topics and/or no longer subscribed topics) [16:58:00] a week ago the lag was 0 (non existent!) so I don't see anything [16:58:35] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4154581 (10Milimetric) hm, indeed https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/all-projects/anonymous/all-page-types/monthly/2016030100/2018042400 shows... [16:59:43] ottomata: one thing that I keep forgetting to ask [17:00:13] our kafka prometheus monitoring might be a bit "noisy", like the partition sync lag [17:00:23] because they look for the past 30m of data [17:00:52] so after restarts, if we have some spikes, they tend to fire and not clear after a lot of time [17:01:12] not a bit deal but we could think about reviewing the time window [17:09:13] yeah [17:09:16] sounds good elukey [17:09:24] we could expand it [17:09:30] 30m was kinda arbitrary, just to have a period [17:12:22] all right will send a code review tomorrow :) [17:19:33] milimetric: did you look at anonymous data in snapshot? [17:22:05] nuria_: I'm looking, but haven't concluded anything [17:26:23] milimetric: is data filled in on snapshot? [17:26:28] milimetric: probably yes, right? [17:26:47] milimetric: i have an interview will be back in abit [17:26:50] nuria_: yeah, this is the confusing result to the naive query: [17:26:53] https://www.irccloud.com/pastebin/vvLPCp85/ [17:27:05] milimetric: what query? [17:27:14] (those are the counts of anon revisions for those two wikis for those two snapshots) [17:27:20] so like select where rev_user = 0 basically [17:27:29] select wiki_db, snapshot, count(*) from wmf_raw.mediawiki_revision where snapshot in ('2018-03', '2018-01') and wiki_db in ('eswiki', 'rowiki') and rev_user = 0 group by wiki_db, snapshot; [17:27:51] it's basically exactly what you'd expect, slightly higher numbers in the more recent snapshot [17:27:54] rev_user =0 [17:27:58] means anonymous? [17:28:02] yes [17:28:05] so the error must be somewhere in denormalize [17:28:13] milimetric: in reduce you mean [17:28:31] i guess one of those two places [17:28:34] milimetric: will check back in a bit [17:28:42] np, gl at the interview [17:37:49] a-team: added a journalnode section to https://grafana.wikimedia.org/dashboard/db/analytics-hadoop [17:39:13] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add JMX and Prometheus monitoring to Hadoop Journal nodes - https://phabricator.wikimedia.org/T192905#4154714 (10elukey) Added a journalnode section to https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1 [17:39:25] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add JMX and Prometheus monitoring to Hadoop Journal nodes - https://phabricator.wikimedia.org/T192905#4154715 (10elukey) [17:39:41] done! [17:40:33] all right, done for the day :) [17:40:39] talk with you tomorrow o/ [17:40:48] let me know if we can or not proceed with the Druid upgrade [17:40:57] * elukey off! [17:41:28] elukey: there are still data issues, if it doesn't hurt anything we should probably wait [17:41:44] have a good night [17:47:29] 10Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog, and 2 others: Enable Reading List Syncing usage stats - https://phabricator.wikimedia.org/T191859#4154761 (10chelsyx) > BTW, just checking that you are synced up with the iOS team on this. It sound... [17:48:55] 10Analytics, 10Collaboration-Team-Triage, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Make EchoNotification job JSON-serialiable - https://phabricator.wikimedia.org/T192945#4154764 (10Pchelolo) p:05Triage>03Normal [17:50:40] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (blocked): Make gwtoolsetUploadMediafileJob JSON-serializable - https://phabricator.wikimedia.org/T192946#4154783 (10Pchelolo) p:05Triage>03Normal [17:51:04] 10Analytics, 10Collaboration-Team-Triage, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Make EchoNotification job JSON-serialiable - https://phabricator.wikimedia.org/T192945#4154796 (10Pchelolo) [18:10:32] 10Analytics, 10Maps-Sprint, 10Operations, 10RESTBase, and 3 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10Gehel) [18:27:36] milimetric: back [18:27:43] milimetric: want to talk in cave? [18:29:24] nuria_: yeah, one sec I'll join [18:30:24] 10Analytics, 10Commons, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Make gwtoolsetUploadMediafileJob JSON-serializable - https://phabricator.wikimedia.org/T192946#4154925 (10Pchelolo) [18:40:55] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4154938 (10Nuria) Reindexing again 2018-02 data, looking into 2018-03 issue with anonymous editors oozie job -info 0034476-180330093100664-oozie-oozi-W [19:24:23] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Replacement of stat1002 and stat1003 - https://phabricator.wikimedia.org/T152712#4155089 (10Cmjohnson) [19:32:17] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&v [19:32:17] n-eqiad_to_jumbo-eqiad [19:32:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name [19:32:38] bo-eqiad [19:32:48] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main- [19:32:57] PROBLEM - Throughput of EventLogging EventError events on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [19:33:07] PROBLEM - Zookeeper node JVM Heap usage on conf1001 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [19:33:17] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad dropped message count in last 30m on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_nam [19:33:17] iad [19:33:22] PROBLEM - Zookeeper node JVM Heap usage on conf1002 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [19:33:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&v [19:33:28] n-eqiad_to_jumbo-eqiad [19:33:37] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&va [19:33:37] -codfw_to_main-eqiad [19:33:39] PROBLEM - Zookeeper node JVM Heap usage on conf1003 is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [19:33:47] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad average message consume rate in last 30m on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mir [19:33:47] d_to_eqiad [19:33:49] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&va [19:33:49] -codfw_to_main-eqiad [19:33:57] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad average message produce rate in last 30m on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mir [19:33:57] d_to_eqiad [19:34:08] RECOVERY - Zookeeper node JVM Heap usage on conf1001 is OK: (C)9.72e+08 ge (W)9.21e+08 ge 4.862e+08 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [19:34:10] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirr [19:34:10] _to_jumbo-eqiad [19:34:16] whaaaat [19:34:28] PROBLEM - Throughput of EventLogging NavigationTiming events on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [19:34:34] elukey: see ops, i'm not sure if i caused this yet or not, don't htink i ddid [19:34:38] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 503.3 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [19:34:39] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 384.9 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [19:34:42] prometheus in eqiad is not responding to queries [19:34:47] RECOVERY - Zookeeper node JVM Heap usage on conf1003 is OK: (C)9.72e+08 ge (W)9.21e+08 ge 8.213e+08 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [19:35:08] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is CRITICAL: http://prometheus.svc.eqiad.wmnet/ops/api/v1/query timeout while fetching: HTTPConnectionPool(host=prometheus.svc.eqiad.wmnet, port=80): Read timed out. (read timeout=10) https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name= [19:35:08] -codfw [19:35:16] yeah I figured something like that, my phone started ringing like crazy :D [19:52:00] yeah [19:52:14] elukey: you are done working for the day ya? not avail for a grafana brain bounce? :) [19:53:30] ottomata: Marika could probably kill me if I start working now, do you mind if we do it tomorrow when you log in? :) [19:54:00] do it tomorrow yes! [19:54:28] ack! Thanks :) [19:54:33] * elukey off again! [20:10:43] quick question: how long are kafka EL topic messages kept around? (If I understand Kafka doc correctly, normally messages have a TTL, but maybe I got it wrong...) [20:11:27] 10Analytics, 10Cassandra, 10Maps-Sprint, 10Operations, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4155263 (10mobrovac) [20:30:17] (03PS3) 10Framawiki: Update dependencies [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/428140 (https://phabricator.wikimedia.org/T192731) [20:32:59] AndyRussG: 7 days [20:33:13] but, we import them all into hadoo [20:33:15] and those are 90 days [20:33:54] ottomata: gotcha thanks!!! [20:34:21] ottomata: 60 days, right? [20:34:38] no 90 [20:34:55] webrequest was the only thing that was ever 60 days, and that was years ago :) [20:36:14] Hi team [20:36:17] Finally back [20:36:17] hiyaaa [20:36:21] wow sailing so fun! [20:36:21] hello [20:36:23] how was it? [20:36:33] Was awesome ottomata :) [20:36:52] Sun like we never have in Britanny, not too much wind so great for kids [20:37:04] By chance we didn't forgot tan-lotion :) [20:37:30] nice [20:37:35] big boat little boat? [20:38:00] medium-small boat :) like 7/8m [20:38:50] ottomata: I've learnt sailing when I was a kid and then didn't kept to it much - Now that I have kids, I enjoy a lot getting back to it :) [20:39:04] nuria_: batcave for data issues? [20:39:08] milimetric: --^ [20:39:12] joal: yes [20:39:50] ok next offsite brittany joal is taking us sailing [20:41:23] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: Enable Reading List Syncing usage stats - https://phabricator.wikimedia.org/T191859#4155401 (10mpopov) [20:53:20] 10Analytics: only hdfs (or authenticated user) should be able to run indexing jobs - https://phabricator.wikimedia.org/T192959#4155423 (10Nuria) [20:55:40] ottomata: if you skip ;) I don't feel good enough to beskipper, but I'd love tht !! [20:56:01] ottomata: anything I can help with on kakfa/mirrormaker? [20:57:24] joal nopers not right now! [20:57:32] got the prometheus / profile stuff all applied to main clusters [20:58:11] next gotta plan stretch / java 8 upgrade [20:59:33] k ottomata - So the alerts were kinda expected I assume [21:01:36] well no, they were actually caused by my grafana dashboarding! [21:01:49] there is some bug in the way the 'all' wildcard value is selected [21:01:58] and it ended up query for all eqiad hosts [21:02:01] which overloaded the prometheus server [21:02:09] and caused regular alert queries to timeout [21:03:47] got it ottomata [21:03:59] Thanks for the explanation [21:16:10] so twitter has 12k active Hadoop nodes :-O (we have 50) https://twitter.com/marsanfra/status/933488292456054784 [21:26:04] haha amzing [21:26:04] yeah [21:29:20] 10Quarry, 10Patch-For-Review: Update dependencies - https://phabricator.wikimedia.org/T192731#4155610 (10Framawiki) For the record ```lang=shell framawiki@quarry-main-01:~$ /srv/venv/bin/pip freeze Flask==0.10.1 Jinja2==2.7.3 MarkupSafe==0.23 PyJWT==1.4.1 PyMySQL==0.6.2 PyYAML==3.11 SQLAlchemy==0.9.7 Werkzeug... [22:32:46] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4155784 (10Nuria) Reindexing done, need to delete last month on snapshot. [22:32:59] neilpquinn: fyi that we found an issue with our data lake 2018-03 snapshot, we are looking into it. [22:46:10] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: [EPIC] Reading List Sync service analytics - https://phabricator.wikimedia.org/T191859#4155814 (10mpopov) [22:48:28] (03PS1) 10Nuria: Fixing issue with merges [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/428850 [22:48:42] (03CR) 10Nuria: [V: 032 C: 032] Fixing issue with merges [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/428850 (owner: 10Nuria) [22:56:59] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: [EPIC] Reading List Sync service analytics - https://phabricator.wikimedia.org/T191859#4155832 (10mpopov) @APalmer_WMF @Fjalapeno @Jhernandez: can y'all please take a look at the updated descripti... [23:13:54] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: [EPIC] Reading List Sync service analytics - https://phabricator.wikimedia.org/T191859#4155885 (10Nuria) @mpopov FYI that adding things to X-nalytics does not work automagically, we strongly recom... [23:36:03] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: [EPIC] Reading List Sync service analytics - https://phabricator.wikimedia.org/T191859#4155946 (10mpopov) >>! In T191859#4155885, @Nuria wrote: > @mpopov FYI that adding things to X-Analytics does...