[04:20:07] PROBLEM - Check the last execution of monitor_refine_eventlogging_eventbus on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_eventbus [06:19:34] hello! [06:21:07] so each time in the logs on an-coord1001 I can see [06:21:08] task error: Container [pid=17784,containerID=container_e01_1553764233554_46256_01_000003] is running beyond phys [06:21:11] ical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 6.1 GB of 4.2 GB virtual memory used. Killing container. [06:21:27] that makes sense, by default the map xmx limit is 1638Mb [06:22:29] example of map failure [06:22:30] https://yarn.wikimedia.org/jobhistory/tasks/job_1553764233554_46256/m [06:29:09] mmmm and it doesn't seem a single topic [07:00:40] wow so I have raised limits in camus' conf.. less map kills but still [07:00:43] is running beyond physical memory limits. Current usage: 4.0 GB of 4 GB [07:07:00] Hi elukey [07:07:35] elukey: this feels weird [07:08:07] joal: I am trying with 8192 as limit, seems not killing this time [07:08:11] bonjour :) [07:08:44] elukey: I'm wondering fo the possible reasons ... huge message? too many topics? [07:08:54] elukey: and also, why now? [07:09:35] I'd say huge messages, at some point the mappers have difficulties with the input data.. would it sound plausible? [07:10:18] yes, but man, a message needed to move from 2G to 8G ??? [07:11:12] no idea how the mapper does the job exactly, probably it reads a lot of big messages (say close to out kafka limit, 4Mb) on the haep? [07:11:15] PROBLEM - Check the last execution of camus-eventbus on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-eventbus [07:11:15] *heap? [07:11:21] it would quickly fill up [07:11:37] yeah that one is due to me executing camus manually [07:11:44] yeah, makes sense [07:11:45] I know icinga I am sorry [07:12:01] it is now stuck at 93% sigh [07:12:12] ah no slowly passed to 94 [07:12:21] Andrew is still not back [07:12:27] hahahah [07:12:29] elukey: IIRC camus is time-bound [07:12:34] :) [07:13:08] how are you? All good during the weekend? [07:14:18] All good thanks, I recovered over the weekend - fever was gone on sunday [07:14:22] How about you? [07:14:54] good, so now we have back 4 Josephs! :D \o/ [07:15:12] huhu :) I'll do my best to try to fulfill the expectation ;) [07:16:02] all good! Didn't do much, but I slept a bit after the long week :D [07:18:41] camus done! [07:18:56] \o/ Thanks a lot elukey [07:19:36] elukey: how much lag do wew need to cover? [07:20:17] I think 2/3 of a day [07:20:42] I had the feeling it was more - Great [07:21:31] RECOVERY - Check the last execution of camus-eventbus on an-coord1001 is OK: OK: Status of the systemd unit camus-eventbus [07:23:58] this was me manually resetting the unit [07:35:39] RECOVERY - Check the last execution of monitor_refine_eventlogging_eventbus on an-coord1001 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_eventbus [07:36:03] ah joal I wanted to show you https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/501578/ [07:36:22] the idea is to create a new group called "analytics-deployers" to add people that can scap as analytics [07:36:32] atm is analytics-admins, that is not really flexibly [07:36:35] *flexible [07:37:22] works for me elukey :) I know Andrew didn't want to complexify groups, but it was a while ago, and needs have changed [07:38:23] yeah I know, I'll have a chat with him today, but I think it is better than using analytics-admins (that now allows a ton of sudo perms) [07:38:46] brb! [08:35:18] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [08:37:42] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [08:55:22] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:38] sigh [08:58:28] again our dear bot [08:58:48] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:00:56] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:00:56] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:03:41] need to go out for an errand, will be back in ~1:30 probably! [09:08:47] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:09:59] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:30:37] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:32:09] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:34:09] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:34:09] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:34:11] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:35:05] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:35:47] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:36:09] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:37:15] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:39:31] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:48:33] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:52:11] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:54:23] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:55:37] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:56:56] (03CR) 10Hoo man: [C: 03+2] Count number of Wikidata edits by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE)) [09:57:17] (03Merged) 10jenkins-bot: Count number of Wikidata edits by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE)) [09:57:20] (03Merged) 10jenkins-bot: Track number of links to Wikidata entity namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500966 (https://phabricator.wikimedia.org/T218903) (owner: 10Michael Große) [09:59:55] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:01:08] (03PS1) 10Hoo man: Count number of Wikidata edits by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/502169 (https://phabricator.wikimedia.org/T218901) [10:03:05] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:36] (03PS1) 10Hoo man: Track number of links to Wikidata entity namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/502179 (https://phabricator.wikimedia.org/T218903) [10:04:15] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:06:17] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:14:49] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:15:59] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:18:54] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:19:48] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:23:00] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:23:54] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:34:06] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:35:47] silencing the alarms [10:40:12] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:41:26] just done --^ [10:41:35] quick lunch then I'll be fully available [10:43:31] (03CR) 10Michael Große: [C: 03+1] "Not sure if there is much I can do here?" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/502179 (https://phabricator.wikimedia.org/T218903) (owner: 10Hoo man) [11:19:39] (03PS1) 10Joal: Update mediawiki-history per-page and per-editor [analytics/aqs] - 10https://gerrit.wikimedia.org/r/502198 (https://phabricator.wikimedia.org/T219910) [11:24:08] (03PS2) 10Joal: Update mediawiki-history per-page and per-editor [analytics/aqs] - 10https://gerrit.wikimedia.org/r/502198 (https://phabricator.wikimedia.org/T219910) [11:24:25] * joal is fed up by the AQS alarms ---^ [11:28:39] nuria: whenever you have time - https://phabricator.wikimedia.org/T220084 [11:29:15] I think this approval is kind of borderline for us (since it is wmde users working on analytics nodes) but better triple checking :) [11:35:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10JAllemandou) I did a quick analysis over request-patterns: 94% of edits-per-page requests made on April 4th were on a timespan of more than 1 year... [12:20:12] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10fgiunchedi) Removing mailing-lists tag since work there is done. [12:21:53] 10Analytics, 10Operations: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10fgiunchedi) Removing mailing-lists tag since work there is done. [12:24:51] 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10jbond) [12:27:07] 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10jbond) p:05Triage→03Normal [12:31:41] 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10elukey) @jbond IIUC Adam will keep working with us as volunteer, not sure if we need to follow up with this task. What do you think? [12:52:07] 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10jbond) @elukey he is remaining as a volunteer so i agree this probably doesn't need an action. however im not familiar enough with HDFS/PPI stuff to know if there is a difference between the WMF and N... [13:07:54] 10Analytics, 10WMF-NDA-Requests: Check PPI leftovers - awight - https://phabricator.wikimedia.org/T220377 (10elukey) I am not sure either, let's check anyway :) @awight is any of the following needed and/or containing PII data that we can remove? ` ====== stat1004 ====== total 4 drwxrwxr-x 3 awight wikidev 4... [13:11:26] brb~ [13:43:15] is Andrew working today? [13:43:45] ¯\_(ツ)_/¯ [13:54:02] probably not, I wanted to ask some questions about the new cdh [13:57:51] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) > But you'd want to upgrade these syst... [13:59:55] 10Analytics, 10EventBus, 10Operations, 10serviceops, and 5 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Ottomata) [14:00:00] 10Analytics, 10EventBus, 10Operations, 10serviceops, and 5 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Ottomata) 05Open→03Resolved [14:00:32] elukey: looks like he's back --^ ;) [14:00:49] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Investigate if kafka can decline requests to consume from consumers that support an older protocol - https://phabricator.wikimedia.org/T219936 (10Ottomata) Until we get rid of Camus, we can't restrict old client versions. Camus uses a 0.8 client. :/ [14:00:50] he is! [14:01:14] elukey: do you agree for me to test a kill-task for druid-public on an old datasource manually (other approach than the one used by the script on friday) [14:01:40] elukey: Goal is to make sure this approach doesn't break brokers [14:02:06] ahh the kill task is to delete dasources [14:02:21] elukey: kill-task is to deep-delete segments [14:02:25] yes yes [14:02:38] elukey: when you delete all segments of a disabled datasource, it is considered deleted [14:02:53] elukey: Testing :) [14:03:30] yes I am aware of that, I wasn't getting the "kill" part :D [14:08:06] 10Analytics, 10EventBus, 10Operations, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron) [14:08:30] ottomata: o/ [14:08:33] hiiiiii [14:13:45] HIIII [14:13:47] o/ o/ o/ [14:13:53] sorry man emails are crazy hello! [14:14:09] :) [14:14:29] now that you are officially back we can assume that the outages are over [14:16:06] hahah [14:16:07] oh man [14:18:42] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) >>! In T219842#5093612, @Ottomata wrote:... [14:19:34] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) +1 sounds good. [14:20:02] hey everyone :) [14:20:20] technically we're having office hours right now! [14:20:36] but nobody's been showing up so we gotta market it more. I'll have to ping johan [14:22:22] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) Actually, this makes sense. I think E... [14:26:03] ottomata: it is a serious problem that we need to start take into account when you go away for some days :D [14:26:19] last week it was a nightmare [14:26:32] looks like the 'POST kill stask' put a lot less pressure on druid [14:29:17] (03CR) 10Ottomata: Oozie: add article recommender (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [14:29:44] (03CR) 10Ottomata: "I think merging this as is is fine! :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [14:29:49] (03CR) 10Ottomata: [C: 03+2] Oozie: add article recommender [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [14:29:51] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Oozie: add article recommender [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [14:31:40] joal: nice! [14:32:12] elukey: I will refactor druid deletion accordingly as suggested in T220111 [14:32:13] T220111: Refactor druid data deletion script - https://phabricator.wikimedia.org/T220111 [14:34:38] super [14:37:27] hey team [14:37:32] omg alarms... [14:37:38] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 3 others: Modern Event Platform: Stream Intake Service: Implementation: Deployment Pipeline - https://phabricator.wikimedia.org/T211247 (10akosiaris) [14:37:40] 10Analytics, 10EventBus, 10Operations, 10Patch-For-Review, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10akosiaris) 05Open→03Stalled Stalling until we have some sane solution. [14:37:51] elukey, can I help w/ sth? [14:38:40] (03CR) 10Bmansurov: "Thanks for reviews and merging!" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [14:39:41] mforns: all handled! For aqs we are waiting on a restbase deploy, and camus seems behaving after this morning [14:39:55] oh wow, ok [14:40:43] 10Analytics, 10Research, 10Article-Recommendation: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10bmansurov) [14:40:57] 10Analytics, 10Research, 10Article-Recommendation: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10bmansurov) [14:40:59] 10Analytics, 10Research, 10Wikidata: Copy Wikidata dumps to HDFs - https://phabricator.wikimedia.org/T209655 (10bmansurov) [14:46:01] (03CR) 10Ottomata: Oozie: add article recommender (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [14:50:23] !log backfilling prefupdate schema into druid from Jan 1 2019 until Apr 1 2019 [14:50:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:51:05] (03CR) 10Milimetric: "Looks great, small nit on the max seconds used. I'm not sure about the usefulness of these endpoints if we can't look at full history, bu" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/502198 (https://phabricator.wikimedia.org/T219910) (owner: 10Joal) [14:59:11] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501715 (https://phabricator.wikimedia.org/T207280) (owner: 10HaeB) [15:04:02] 10Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10mforns) [15:21:00] 10Analytics, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10Milimetric) The easiest thing to do is to delete the old data and change the schema going forward. Let me know if this is ok to do, @AndyRussG.... [15:28:40] 10Analytics: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10Milimetric) 05Declined→03Open Thanks @Bawolff, I'll reopen this as it's enough of a pain for me and other workflows that it needs to be taken care of. Feel free to unsubscribe.... [15:33:00] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10fdans) Data has been ingested up to Apr 1 00:00 (this means Apr 1 is not included). I'll add now the patch for puppet to load routinely. https://tur... [16:00:13] elukey: +1 to report, looks good. will just add ticket to keep track of backfilling [16:00:21] elukey: please share with ops team [16:00:23] super [16:00:25] thanks! [16:02:29] 10Analytics, 10Analytics-Kanban, 10Research, 10Article-Recommendation: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10Nuria) [16:04:10] 10Analytics, 10Wikimedia-Incident: Investigate if kafka can decline requests to consume from consumers that support an older protocol - https://phabricator.wikimedia.org/T219936 (10Nuria) [16:05:00] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) Added request for more brokers to our har... [16:14:32] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: attempt to backfill eventlogging data from eventlogging-client-side topic into per schema topics - https://phabricator.wikimedia.org/T220421 (10Nuria) [16:18:59] HaeB: friendly remainder that this code patch is not merged cause it needs changes: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/464915/ [16:27:43] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10kostajh) a:03Pchelolo Yay, this works! Not sure what changed with the infrastructure, but today I was able to clear my 1,82... [16:35:47] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10MusikAnimal) >>! In T219857#5085894, @fdans wrote: > @MusikAnimal is this report coming from users getting 429? Are you getting the errors yourself? Yes... [16:37:45] (03CR) 10Nuria: [C: 03+1] Update mediawiki-history per-page and per-editor [analytics/aqs] - 10https://gerrit.wikimedia.org/r/502198 (https://phabricator.wikimedia.org/T219910) (owner: 10Joal) [16:50:34] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10Nuria) >So on the second try, the API can serve from cache for most of the pages. It only has to pull from storage for the pages that got a 429 on the fir... [16:50:52] musikanimal: ping me if my reply here does not make sense: https://phabricator.wikimedia.org/T219857 [16:54:17] elukey: great incident report [16:54:21] thank you so much for taking care of that [16:54:23] you are the best [16:55:56] ottomata: <3 nuria also added a lot of data [17:02:05] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10MusikAnimal) >>! In T219857#5094410, @Nuria wrote: >>So on the second try, the API can serve from cache for most of the pages. It only has to pull from st... [17:21:06] * elukey off! [17:43:47] 10Analytics, 10Operations, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10ayounsi) 05Open→03Resolved Done, please reopen if any issue. [17:43:50] 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10ayounsi) [17:58:36] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10Tbayer) That link looks great overall. There seems to be a one-day discrepancy though between the dates given on the x-axis and in the mouseover. A... [18:54:51] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10Nuria) >What is the update frequency going to be - hourly? Probably daily [18:54:58] hallo [18:55:15] If I have an existing EventLogging schema, which correct logs events, [18:55:22] and I update the Schema page on Meta, [18:55:46] and I update the extension code that logs to this schema with the correct revision ID, [18:56:09] and I update the revision ID in this EventLoggingSchemas section in extension.json, [18:56:38] is there anything else I have to do to so that events will be logged correctly according the new schema? [18:57:04] that sounds like it aharoni [18:57:20] OK, just confirming: will a new database table be created under the log database? [18:57:23] aharoni: please make sure they validate in beta: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster [18:57:24] assuming 'update extension code' is also sending the event in the new schema format [18:57:31] ah, aharoni to get data into mysql [18:57:33] we have to add to whitelist [18:57:37] hive is on by default [18:57:45] but mysql just issn't scaling so we only turn on via whitelist there [18:57:55] the current schema is on mysql. [18:57:58] ok [18:58:01] then ya it should [18:58:06] aharoni: your data will be on hive as well [18:58:10] do I have to request this update, or will it be autocreated? [18:58:14] autocreated [18:58:24] great, thank you ottomata and nuria! [18:58:30] yup! :) [19:00:09] aharoni: you might want to get used to look ta your data in hive [19:00:34] aharoni: mysql database will be deprecated eventually as it cannot scale [19:00:51] yes, I'm seriously considering it as the next step. [19:03:41] 10Analytics, 10Operations, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10dr0ptp4kt) Thanks. Confirmed it works. [19:09:36] 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10CDanis) So it sounds like the firewall work is done (thanks Arzhel!) Seems like the next thing is to create a Swift container for this usage -- and maybe one just fo... [19:57:06] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [20:00:19] 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (10dduvall... [20:22:40] Hi there [20:23:45] Does anyone know under what conditions does beeline produce: "No current connection" ? I am making beeline calls from an R script running on crontab, stat1007. [20:28:38] nuria: yes, i know, the trailing spaces :-/ it's been on my todo list. but i'm kind of a gerrit newbie and haven't amended patches before... [20:28:56] ... i have been following the instructions at https://www.mediawiki.org/wiki/Gerrit/Tutorial#Amending_a_change_(your_own_or_someone_else's) [20:29:21] HaeB: you can ask from help from one of the developers on redaing web cc jdlrobson [20:29:45] ...but they result in git asking whether i want to resubmit all 100s of patches since the original one [20:30:01] yes, will try to grab someone in the office [20:59:38] HaeB: also bearloga can help [21:02:29] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [21:09:08] 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Done with CPT), and 2 others: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (1... [21:10:07] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [21:11:25] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [21:26:27] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10dr0ptp4kt) Hi, I'm requesting access to gpu-testers as well in order t... [21:28:03] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [21:36:55] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [21:43:26] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: Many small wikis missing from mediawiki_history dataset - https://phabricator.wikimedia.org/T220456 (10Neil_P._Quinn_WMF) [21:44:37] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [21:45:04] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: Many small wikis missing from mediawiki_history dataset - https://phabricator.wikimedia.org/T220456 (10Neil_P._Quinn_WMF) [21:45:07] 10Analytics-Kanban, 10Product-Analytics: Address data quality issues in the mediawiki_history dataset - https://phabricator.wikimedia.org/T204953 (10Neil_P._Quinn_WMF) [22:30:39] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [22:45:33] 10Analytics, 10ChangeProp, 10Community-Tech, 10EventBus, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10aezell) [22:47:15] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [23:07:41] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [23:12:49] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [23:50:05] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [23:55:17] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus