[04:20:07] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_eventbus on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_eventbus
[06:19:34] <elukey>	 hello!
[06:21:07] <elukey>	 so each time in the logs on an-coord1001 I can see
[06:21:08] <elukey>	 task error: Container [pid=17784,containerID=container_e01_1553764233554_46256_01_000003] is running beyond phys
[06:21:11] <elukey>	 ical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 6.1 GB of 4.2 GB virtual memory used. Killing container.
[06:21:27] <elukey>	 that makes sense, by default the map xmx limit is 1638Mb
[06:22:29] <elukey>	 example of map failure
[06:22:30] <elukey>	 https://yarn.wikimedia.org/jobhistory/tasks/job_1553764233554_46256/m
[06:29:09] <elukey>	 mmmm and it doesn't seem a single topic
[07:00:40] <elukey>	 wow so I have raised limits in camus' conf.. less map kills but still 
[07:00:43] <elukey>	 is running beyond physical memory limits. Current usage: 4.0 GB of 4 GB
[07:07:00] <joal>	 Hi elukey 
[07:07:35] <joal>	 elukey: this feels weird
[07:08:07] <elukey>	 joal: I am trying with 8192 as limit, seems not killing this time
[07:08:11] <elukey>	 bonjour :)
[07:08:44] <joal>	 elukey: I'm wondering fo the possible reasons ... huge message? too many topics?
[07:08:54] <joal>	 elukey: and also, why now?
[07:09:35] <elukey>	 I'd say huge messages, at some point the mappers have difficulties with the input data.. would it sound plausible?
[07:10:18] <joal>	 yes, but man, a message needed to move from 2G to 8G ???
[07:11:12] <elukey>	 no idea how the mapper does the job exactly, probably it reads a lot of big messages (say close to out kafka limit, 4Mb) on the haep?
[07:11:15] <icinga-wm>	 PROBLEM - Check the last execution of camus-eventbus on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-eventbus
[07:11:15] <elukey>	 *heap?
[07:11:21] <elukey>	 it would quickly fill up
[07:11:37] <elukey>	 yeah that one is due to me executing camus manually
[07:11:44] <joal>	 yeah, makes sense
[07:11:45] <elukey>	 I know icinga I am sorry
[07:12:01] <elukey>	 it is now stuck at 93% sigh
[07:12:12] <elukey>	 ah no slowly passed to 94
[07:12:21] <elukey>	 Andrew is still not back
[07:12:27] <elukey>	 hahahah
[07:12:29] <joal>	 elukey: IIRC camus is time-bound
[07:12:34] <joal>	 :)
[07:13:08] <elukey>	 how are you? All good during the weekend?
[07:14:18] <joal>	 All good thanks, I recovered over the weekend - fever was gone on sunday
[07:14:22] <joal>	 How about you?
[07:14:54] <elukey>	 good, so now we have back 4 Josephs! :D \o/
[07:15:12] <joal>	 huhu :) I'll do my best to try to fulfill the expectation ;)
[07:16:02] <elukey>	 all good! Didn't do much, but I slept a bit after the long week :D
[07:18:41] <elukey>	 camus done!
[07:18:56] <joal>	 \o/ Thanks a lot elukey 
[07:19:36] <joal>	 elukey: how much lag do wew need to cover?
[07:20:17] <elukey>	 I think 2/3 of a day
[07:20:42] <joal>	 I had the feeling it was more - Great
[07:21:31] <icinga-wm>	 RECOVERY - Check the last execution of camus-eventbus on an-coord1001 is OK: OK: Status of the systemd unit camus-eventbus
[07:23:58] <elukey>	 this was me manually resetting the unit
[07:35:39] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_eventbus on an-coord1001 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_eventbus
[07:36:03] <elukey>	 ah joal I wanted to show you https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/501578/
[07:36:22] <elukey>	 the idea is to create a new group called "analytics-deployers" to add people that can scap as analytics
[07:36:32] <elukey>	 atm is analytics-admins, that is not really flexibly
[07:36:35] <elukey>	 *flexible
[07:37:22] <joal>	 works for me elukey :) I know Andrew didn't want to complexify groups, but it was a while ago, and needs have changed
[07:38:23] <elukey>	 yeah I know, I'll have a chat with him today, but I think it is better than using analytics-admins (that now allows a ton of sudo perms)
[07:38:46] <elukey>	 brb!
[08:35:18] <icinga-wm>	 PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus
[08:37:42] <icinga-wm>	 RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus
[08:55:22] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:57:38] <elukey>	 sigh
[08:58:28] <elukey>	 again our dear bot
[08:58:48] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:00:56] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:00:56] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:03:41] <elukey>	 need to go out for an errand, will be back in ~1:30 probably!
[09:08:47] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:09:59] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:30:37] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:32:09] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:34:09] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:34:09] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:34:11] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:35:05] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:35:47] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:36:09] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:37:15] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:39:31] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:48:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:52:11] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:54:23] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:55:37] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:56:56] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Count number of Wikidata edits by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE))
[09:57:17] <wikibugs>	 (03Merged) 10jenkins-bot: Count number of Wikidata edits by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE))
[09:57:20] <wikibugs>	 (03Merged) 10jenkins-bot: Track number of links to Wikidata entity namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500966 (https://phabricator.wikimedia.org/T218903) (owner: 10Michael Große)
[09:59:55] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:01:08] <wikibugs>	 (03PS1) 10Hoo man: Count number of Wikidata edits by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/502169 (https://phabricator.wikimedia.org/T218901)
[10:03:05] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:03:36] <wikibugs>	 (03PS1) 10Hoo man: Track number of links to Wikidata entity namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/502179 (https://phabricator.wikimedia.org/T218903)
[10:04:15] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:06:17] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:14:49] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:15:59] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:18:54] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:19:48] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:23:00] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:23:54] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:34:06] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:35:47] <elukey>	 silencing the alarms
[10:40:12] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:41:26] <elukey>	 just done --^
[10:41:35] <elukey>	 quick lunch then I'll be fully available
[10:43:31] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] "Not sure if there is much I can do here?" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/502179 (https://phabricator.wikimedia.org/T218903) (owner: 10Hoo man)
[11:19:39] <wikibugs>	 (03PS1) 10Joal: Update mediawiki-history per-page and per-editor [analytics/aqs] - 10https://gerrit.wikimedia.org/r/502198 (https://phabricator.wikimedia.org/T219910)
[11:24:08] <wikibugs>	 (03PS2) 10Joal: Update mediawiki-history per-page and per-editor [analytics/aqs] - 10https://gerrit.wikimedia.org/r/502198 (https://phabricator.wikimedia.org/T219910)
[11:24:25] * joal is fed up by the AQS alarms ---^
[11:28:39] <elukey>	 nuria: whenever you have time - https://phabricator.wikimedia.org/T220084
[11:29:15] <elukey>	 I think this approval is kind of borderline for us (since it is wmde users working on analytics nodes) but better triple checking :)
[11:35:50] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10JAllemandou) I did a quick analysis over request-patterns: 94% of edits-per-page requests made on April 4th were on a timespan of more than 1 year...
[12:20:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10fgiunchedi) Removing mailing-lists tag since work there is done.
[12:21:53] <wikibugs>	 10Analytics, 10Operations: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10fgiunchedi) Removing mailing-lists tag since work there is done.
[12:24:51] <wikibugs>	 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10jbond)
[12:27:07] <wikibugs>	 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10jbond) p:05Triage→03Normal
[12:31:41] <wikibugs>	 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10elukey) @jbond IIUC Adam will keep working with us as volunteer, not sure if we need to follow up with this task. What do you think?
[12:52:07] <wikibugs>	 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10jbond) @elukey he is remaining as a volunteer so i agree this probably doesn't need an action. however im  not familiar enough with HDFS/PPI stuff to know if  there is a difference between the WMF and N...
[13:07:54] <wikibugs>	 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10elukey) I am not sure either, let's check anyway :)  @awight is any of the following needed and/or containing PII data that we can remove?  ` ====== stat1004 ====== total 4 drwxrwxr-x 3 awight wikidev 4...
[13:11:26] <elukey>	 brb~
[13:43:15] <elukey>	 is Andrew working today?
[13:43:45] <joal>	 ¯\_(ツ)_/¯
[13:54:02] <elukey>	 probably not, I wanted to ask some questions about the new cdh
[13:57:51] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) > But you'd want to upgrade these syst...
[13:59:55] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10serviceops, and 5 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Ottomata)
[14:00:00] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10serviceops, and 5 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Ottomata) 05Open→03Resolved
[14:00:32] <joal>	 elukey: looks like he's back --^ ;)
[14:00:49] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Investigate if kafka can decline requests to consume from consumers that support an older protocol - https://phabricator.wikimedia.org/T219936 (10Ottomata) Until we get rid of Camus, we can't restrict old client versions.  Camus uses a 0.8 client. :/
[14:00:50] <elukey>	 he is!
[14:01:14] <joal>	 elukey: do you agree for me to test a kill-task for druid-public on an old datasource manually (other approach than the one used by the script on friday)
[14:01:40] <joal>	 elukey: Goal is to make sure this approach doesn't break brokers
[14:02:06] <elukey>	 ahh the kill task is to delete dasources
[14:02:21] <joal>	 elukey: kill-task is to deep-delete segments
[14:02:25] <elukey>	 yes yes 
[14:02:38] <joal>	 elukey: when you delete all segments of a disabled datasource, it is considered deleted
[14:02:53] <joal>	 elukey: Testing :)
[14:03:30] <elukey>	 yes I am aware of that, I wasn't getting the "kill" part :D
[14:08:06] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron)
[14:08:30] <elukey>	 ottomata: o/
[14:08:33] <elukey>	 hiiiiii
[14:13:45] <ottomata>	 HIIII
[14:13:47] <ottomata>	 o/ o/ o/
[14:13:53] <ottomata>	 sorry man emails are crazy hello!
[14:14:09] <elukey>	 :)
[14:14:29] <elukey>	 now that you are officially back we can assume that the outages are over
[14:16:06] <ottomata>	 hahah
[14:16:07] <ottomata>	 oh man
[14:18:42] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) >>! In T219842#5093612, @Ottomata wrote:...
[14:19:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) +1 sounds good.
[14:20:02] <milimetric>	 hey everyone :)
[14:20:20] <milimetric>	 technically we're having office hours right now!
[14:20:36] <milimetric>	 but nobody's been showing up so we gotta market it more.  I'll have to ping johan
[14:22:22] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) Actually, this makes sense.  I think E...
[14:26:03] <elukey>	 ottomata: it is a serious problem that we need to start take into account when you go away for some days :D
[14:26:19] <elukey>	 last week it was a nightmare
[14:26:32] <joal>	 looks like the 'POST kill stask' put a lot less pressure on druid
[14:29:17] <wikibugs>	 (03CR) 10Ottomata: Oozie: add article recommender (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov)
[14:29:44] <wikibugs>	 (03CR) 10Ottomata: "I think merging this as is is fine!  :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov)
[14:29:49] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Oozie: add article recommender [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov)
[14:29:51] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Oozie: add article recommender [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov)
[14:31:40] <elukey>	 joal: nice!
[14:32:12] <joal>	 elukey: I will refactor druid deletion accordingly as suggested in T220111
[14:32:13] <stashbot>	 T220111: Refactor druid data deletion script - https://phabricator.wikimedia.org/T220111
[14:34:38] <elukey>	 super
[14:37:27] <mforns>	 hey team
[14:37:32] <mforns>	 omg alarms...
[14:37:38] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 3 others: Modern Event Platform: Stream Intake Service: Implementation: Deployment Pipeline - https://phabricator.wikimedia.org/T211247 (10akosiaris)
[14:37:40] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10Patch-For-Review, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10akosiaris) 05Open→03Stalled Stalling until we have some sane solution.
[14:37:51] <mforns>	 elukey, can I help w/ sth?
[14:38:40] <wikibugs>	 (03CR) 10Bmansurov: "Thanks for reviews and merging!" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov)
[14:39:41] <elukey>	 mforns: all handled! For aqs we are waiting on a restbase deploy, and camus seems behaving after this morning
[14:39:55] <mforns>	 oh wow, ok
[14:40:43] <wikibugs>	 10Analytics, 10Research, 10Article-Recommendation: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10bmansurov)
[14:40:57] <wikibugs>	 10Analytics, 10Research, 10Article-Recommendation: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10bmansurov)
[14:40:59] <wikibugs>	 10Analytics, 10Research, 10Wikidata: Copy Wikidata dumps to HDFs - https://phabricator.wikimedia.org/T209655 (10bmansurov)
[14:46:01] <wikibugs>	 (03CR) 10Ottomata: Oozie: add article recommender (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov)
[14:50:23] <fdans>	 !log backfilling prefupdate schema into druid from Jan 1 2019 until Apr 1 2019
[14:50:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:51:05] <wikibugs>	 (03CR) 10Milimetric: "Looks great, small nit on the max seconds used.  I'm not sure about the usefulness of these endpoints if we can't look at full history, bu" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/502198 (https://phabricator.wikimedia.org/T219910) (owner: 10Joal)
[14:59:11] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501715 (https://phabricator.wikimedia.org/T207280) (owner: 10HaeB)
[15:04:02] <wikibugs>	 10Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10mforns)
[15:21:00] <wikibugs>	 10Analytics, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10Milimetric) The easiest thing to do is to delete the old data and change the schema going forward.  Let me know if this is ok to do, @AndyRussG....
[15:28:40] <wikibugs>	 10Analytics: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10Milimetric) 05Declined→03Open Thanks @Bawolff, I'll reopen this as it's enough of a pain for me and other workflows that it needs to be taken care of.  Feel free to unsubscribe....
[15:33:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10fdans) Data has been ingested up to Apr 1 00:00 (this means Apr 1 is not included). I'll add now the patch for puppet to load routinely.  https://tur...
[16:00:13] <nuria>	 elukey: +1 to report, looks good. will just add ticket to keep track of backfilling
[16:00:21] <nuria>	 elukey: please share with ops team
[16:00:23] <elukey>	 super
[16:00:25] <elukey>	 thanks!
[16:02:29] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Research, 10Article-Recommendation: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10Nuria)
[16:04:10] <wikibugs>	 10Analytics, 10Wikimedia-Incident: Investigate if kafka can decline requests to consume from consumers that support an older protocol - https://phabricator.wikimedia.org/T219936 (10Nuria)
[16:05:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) Added request for more brokers to our har...
[16:14:32] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: attempt to backfill eventlogging data from eventlogging-client-side topic into per schema topics - https://phabricator.wikimedia.org/T220421 (10Nuria)
[16:18:59] <nuria>	 HaeB: friendly remainder that this code patch is not merged cause it needs changes: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/464915/ 
[16:27:43] <wikibugs>	 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10kostajh) a:03Pchelolo Yay, this works! Not sure what changed with the infrastructure, but today I was able to clear my 1,82...
[16:35:47] <wikibugs>	 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10MusikAnimal) >>! In T219857#5085894, @fdans wrote: > @MusikAnimal is this report coming from users getting 429? Are you getting the errors yourself?   Yes...
[16:37:45] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] Update mediawiki-history per-page and per-editor [analytics/aqs] - 10https://gerrit.wikimedia.org/r/502198 (https://phabricator.wikimedia.org/T219910) (owner: 10Joal)
[16:50:34] <wikibugs>	 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10Nuria) >So on the second try, the API can serve from cache for most of the pages. It only has to pull from storage for the pages that got a 429 on the fir...
[16:50:52] <nuria>	 musikanimal: ping me if my reply here does not make sense: https://phabricator.wikimedia.org/T219857
[16:54:17] <ottomata>	 elukey:  great incident report
[16:54:21] <ottomata>	 thank you so much for taking care of that
[16:54:23] <ottomata>	 you are the best
[16:55:56] <elukey>	 ottomata: <3 nuria also added a lot of data
[17:02:05] <wikibugs>	 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10MusikAnimal) >>! In T219857#5094410, @Nuria wrote: >>So on the second try, the API can serve from cache for most of the pages. It only has to pull from st...
[17:21:06] * elukey off!
[17:43:47] <wikibugs>	 10Analytics, 10Operations, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10ayounsi) 05Open→03Resolved Done, please reopen if any issue.
[17:43:50] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10ayounsi)
[17:58:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10Tbayer) That link looks great overall. There seems to be a one-day discrepancy though between the dates given on the x-axis and in the mouseover.   A...
[18:54:51] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10Nuria) >What is the update frequency going to be - hourly? Probably daily
[18:54:58] <aharoni>	 hallo
[18:55:15] <aharoni>	 If I have an existing EventLogging schema, which correct logs events,
[18:55:22] <aharoni>	 and I update the Schema page on Meta,
[18:55:46] <aharoni>	 and I update the extension code that logs to this schema with the correct revision ID,
[18:56:09] <aharoni>	 and I update the revision ID in this EventLoggingSchemas section in extension.json,
[18:56:38] <aharoni>	 is there anything else I have to do to so that events will be logged correctly according the new schema?
[18:57:04] <ottomata>	 that sounds like it aharoni 
[18:57:20] <aharoni>	 OK, just confirming: will a new database table be created under the log database?
[18:57:23] <nuria>	 aharoni: please make sure they validate in beta: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster
[18:57:24] <ottomata>	 assuming 'update extension code' is also sending the event in the new schema format
[18:57:31] <ottomata>	 ah, aharoni  to get data into mysql
[18:57:33] <ottomata>	 we have to add to whitelist
[18:57:37] <ottomata>	 hive is on by default
[18:57:45] <ottomata>	 but mysql just issn't scaling so we only turn on via whitelist there
[18:57:55] <aharoni>	 the current schema is on mysql.
[18:57:58] <ottomata>	 ok
[18:58:01] <ottomata>	 then ya it should
[18:58:06] <nuria>	 aharoni: your data will be on hive  as well
[18:58:10] <aharoni>	 do I have to request this update, or will it be autocreated?
[18:58:14] <ottomata>	 autocreated
[18:58:24] <aharoni>	 great, thank you ottomata  and nuria!
[18:58:30] <ottomata>	 yup! :)
[19:00:09] <nuria>	 aharoni: you might want to get used to look ta your data in hive
[19:00:34] <nuria>	 aharoni: mysql database will be deprecated eventually as it cannot scale
[19:00:51] <aharoni>	 yes, I'm seriously considering it as the next step.
[19:03:41] <wikibugs>	 10Analytics, 10Operations, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10dr0ptp4kt) Thanks. Confirmed it works.
[19:09:36] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10CDanis) So it sounds like the firewall work is done (thanks Arzhel!)  Seems like the next thing is to create a Swift container for this usage -- and maybe one just fo...
[19:57:06] <icinga-wm>	 PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus
[20:00:19] <wikibugs>	 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (10dduvall...
[20:22:40] <GoranSM>	 Hi there
[20:23:45] <GoranSM>	 Does anyone know under what conditions does beeline produce: "No current connection" ? I am making beeline calls from an R script running on crontab, stat1007.
[20:28:38] <HaeB>	 nuria: yes, i know, the trailing spaces :-/ it's been on my todo list. but i'm kind of a gerrit newbie and haven't amended patches before... 
[20:28:56] <HaeB>	 ... i have been following the instructions at https://www.mediawiki.org/wiki/Gerrit/Tutorial#Amending_a_change_(your_own_or_someone_else's)
[20:29:21] <nuria>	 HaeB: you can ask from help from one of the developers on redaing web cc jdlrobson 
[20:29:45] <HaeB>	 ...but they result in git asking whether i want to resubmit all 100s of patches since the original one
[20:30:01] <HaeB>	 yes, will try to grab someone in the office
[20:59:38] <nuria>	 HaeB: also bearloga can help
[21:02:29] <icinga-wm>	 RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus
[21:09:08] <wikibugs>	 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Done with CPT), and 2 others: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (1...
[21:10:07] <icinga-wm>	 PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus
[21:11:25] <icinga-wm>	 RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus
[21:26:27] <wikibugs>	 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10dr0ptp4kt) Hi, I'm requesting access to gpu-testers as well in order t...
[21:28:03] <icinga-wm>	 PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus
[21:36:55] <icinga-wm>	 RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus
[21:43:26] <wikibugs>	 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: Many small wikis missing from mediawiki_history dataset - https://phabricator.wikimedia.org/T220456 (10Neil_P._Quinn_WMF)
[21:44:37] <icinga-wm>	 PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus
[21:45:04] <wikibugs>	 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: Many small wikis missing from mediawiki_history dataset - https://phabricator.wikimedia.org/T220456 (10Neil_P._Quinn_WMF)
[21:45:07] <wikibugs>	 10Analytics-Kanban, 10Product-Analytics: Address data quality issues in the mediawiki_history dataset - https://phabricator.wikimedia.org/T204953 (10Neil_P._Quinn_WMF)
[22:30:39] <icinga-wm>	 RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus
[22:45:33] <wikibugs>	 10Analytics, 10ChangeProp, 10Community-Tech, 10EventBus, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10aezell)
[22:47:15] <icinga-wm>	 PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus
[23:07:41] <icinga-wm>	 RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus
[23:12:49] <icinga-wm>	 PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus
[23:50:05] <icinga-wm>	 RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus
[23:55:17] <icinga-wm>	 PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus