[00:55:00] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[00:58:08] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:03:54] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:05:16] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/page
[01:05:16] <icinga-wm>	 e/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:06:52] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:07:28] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Ge
[01:07:28] <icinga-wm>	 sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:10:36] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:11:26] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:16:18] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Ge
[01:16:18] <icinga-wm>	 sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:17:30] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITIC
[01:17:30] <icinga-wm>	  article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:17:32] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Ge
[01:17:32] <icinga-wm>	 sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:18:24] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/page
[01:18:24] <icinga-wm>	 e/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:26:32] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRIT
[01:26:32] <icinga-wm>	 ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:33:42] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:34:32] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:35:44] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:36:16] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[01:36:52] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:12:10] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimed
[02:12:10] <icinga-wm>	 ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:15:18] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:23:04] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/me
[02:23:04] <icinga-wm>	 file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:24:00] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wi
[02:24:00] <icinga-wm>	 /Services/Monitoring/aqs
[02:26:22] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:29:54] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per
[02:29:54] <icinga-wm>	 t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:33:08] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[02:33:56] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[05:25:41] <wikibugs>	 (03PS1) 10Lex Nasser: Create pageviews 'top-per-country' endpoint with tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171)
[05:30:07] <wikibugs>	 (03CR) 10Lex Nasser: "This has been tested locally with SQLite backend, but not with Cassandra. Will do further testing before merging. Please let me know if yo" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser)
[06:44:34] <elukey>	 good morning!
[06:44:46] <elukey>	 AQS had fun during the night :(
[06:48:28] <elukey>	 there was a big jump in traffic before the event
[06:48:29] <elukey>	 https://grafana.wikimedia.org/d/000000526/aqs?orgId=1&from=now-12h&to=now
[06:48:39] <elukey>	 but not rigtht before
[06:54:08] <elukey>	 https://grafana.wikimedia.org/d/000000417/cassandra-system?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=aqs&var-server=All&var-disk=sda&var-disk=sdb&var-disk=sdc&var-disk=sdd&var-disk=sde
[06:55:10] <elukey>	 so my early idea is that cassandra was able to sustain all that traffic for some hours, until it reached its limit and timeouts started to happen
[06:56:25] <elukey>	 IIRC we don't really have any access log for AQS, so it is difficult to figure out what it was, I suspect a bot
[07:08:50] <elukey>	 another interesting graph is
[07:08:51] <elukey>	 https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_pageviews_per_project&var-table=data&var-quantile=99p
[07:09:05] <elukey>	 so read latency was very high for the whole time (hours)
[07:09:12] <elukey>	 but there is a big peak of write requests
[07:10:23] <elukey>	 I am wondering if the daily oozie load jobs kicked in, causing the already delicate state of cassandra to finally collapse
[07:21:13] <elukey>	 https://grafana.wikimedia.org/d/000000418/cassandra?viewPanel=15&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_pageviews_per_project&var-table=data&var-quantile=99p
[07:23:50] <elukey>	 it matches with the alarms
[07:24:28] <elukey>	 even if there was a second burst as well
[07:27:01] <elukey>	 first wave (UTC) 00:55 -> 01:36
[07:27:16] <elukey>	 second wave (UTC) 02:12 -> 02:33
[07:27:23] <elukey>	 (hope to have got them right)
[07:27:34] <elukey>	 the peak in writes is within the first wave
[07:27:42] <elukey>	 but then why it collapsed again?
[07:27:48] <elukey>	 There was a lot of traffic anyway
[07:28:08] <elukey>	 ah wait a lot of compactions happening as well
[07:29:47] <elukey>	 ok I think that for the moment I can re-run the oozie jobs
[07:33:00] <elukey>	 ERROR [MessagingService-Incoming-/10.64.48.149] 2021-01-20 01:29:02,639 CassandraDaemon.java:185 - Exception in thread Thread[MessagingSe
[07:33:03] <elukey>	 rvice-Incoming-/10.64.48.149,5,main]
[07:33:06] <elukey>	 java.lang.OutOfMemoryError: Java heap space
[07:33:07] <elukey>	 sigh
[07:34:41] <elukey>	 all right it is way simpler
[07:34:43] <elukey>	 https://grafana.wikimedia.org/d/000000418/cassandra?viewPanel=28&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_editors_bycountry&var-table=data&var-quantile=99p&from=now-14d&to=now
[07:35:29] <elukey>	 so we have been using almost all the heap for a long time (16G at the moment) and when we went under pressure, Cassandra went OOM and some instances restarted
[07:36:46] <elukey>	 we need to bump it, probably to 24G
[07:37:50] <elukey>	 ah no wait, of course we have 2 instances 
[07:39:06] <elukey>	 so on aqs1004-6 it will be a problem, we have only 64G of ram
[07:39:18] <elukey>	 on the other nodes, we have 128G so it shouldn't be an issue
[07:51:35] <elukey>	 all right all failed coords restarted
[07:56:19] <elukey>	 brb
[08:10:33] <elukey>	 I am trying to see in turnilo if I get some clue
[08:11:52] <elukey>	 I am filtering webrequest_128 for /api/rest_v1/metrics/pageviews/*
[08:13:46] <elukey>	 so there is a python-requests/2.22.0 UA that matches with the rise in traffic
[08:29:10] <elukey>	 https://w.wiki/uzb
[08:31:13] <elukey>	 I created https://gerrit.wikimedia.org/r/657288
[08:31:29] <elukey>	 to avoid bots without a clear UA to hit AQS
[08:55:04] <elukey>	 ah weird jobs failing again, I was probably too aggressive in restarts
[09:10:18] <wikibugs>	 10Analytics: Fix the remaining bugs open on for Hue next - https://phabricator.wikimedia.org/T264896 (10elukey)
[09:50:24] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10ema) >>! In T271953#6755027, @elukey wrote: > @ema quick question - is the client src port something that we could pass from ATS-TLS to Varnish frontend? Via HTTP heade...
[10:15:18] <klausman>	 elukey: I had one comment on that UA/AQS patch
[10:19:49] <elukey>	 klausman: morning! yep answered, thanks a lot for following up, any suggestion is welcome
[10:20:06] <elukey>	 I am about to send an email with the summary of my understanding of this mess to alert@
[10:23:07] <elukey>	 klausman: sent!
[10:27:58] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10elukey) Reporting a chat with the Traffic team happened this morning:  - https://gerrit.wikimedia.org/r/657296 is needed since ATS-TLS doesn't really add any content to...
[10:50:44] <wikibugs>	 10Analytics: Druid datasource drop triggers segment reshuffling by the coordinator - https://phabricator.wikimedia.org/T270173 (10elukey) I followed up to my own thread (sent a looong time ago) to druid's user@ mailing list, let's see if anybody comes back with some suggestion..
[10:52:25] <elukey>	 ok so follow up on present/past outages done, aaand my morning is gone :(
[10:52:58] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10ema) >>! In T271953#6761141, @elukey wrote: > - In Varnish we'll need to add VCL code to set the new parameters to X-Analytics, so that Varnishkafka will pick them up a...
[11:39:41] * elukey lunch!
[12:12:32] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: refinery-core: iputils: refresh cloud addresses [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657329 (https://phabricator.wikimedia.org/T272400)
[13:21:04] <wikibugs>	 (03CR) 10Joal: "Thanks for the review mforns :)" (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[13:22:38] <wikibugs>	 (03PS2) 10Joal: Update oozie jobs tmp folders for ownership/perms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560)
[13:26:10] <wikibugs>	 (03PS2) 10Joal: Change DataFrameToDruid base temporary path [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560)
[13:26:34] <wikibugs>	 (03CR) 10Joal: "Path changed to /wmf/tmp/druid" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[13:32:53] <klausman>	 <- late lunch and some groceries, back in 15 or so.
[13:45:13] <joal>	 I updated the various patches I had ongoing to fix permissions /tmp folders - I used /wmf/tmp/analytics and /wmf/tmp/druid - Hopefully we'll deploy that tonight :)
[13:48:04] <elukey>	 going to check them in a bit :)
[13:48:19] <joal>	 thanks elukey - I hope the folder choice is ok
[13:49:31] <elukey>	 yep yep I like it a lot, thanks for the patience :)
[13:50:06] <joal>	 no problemo elukey - thanks for raising the concern :)
[13:54:11] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Anti-Harassment, 10Event-Platform, and 2 others: Migrate Anti-Harassment EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T268517 (10Ottomata)
[13:57:22] <wikibugs>	 10Analytics, 10Event-Platform, 10Language-analytics, 10MW-1.36-notes (1.36.0-wmf.27; 2021-01-19): UniversalLanguageSelector Event Platform Migration - https://phabricator.wikimedia.org/T267352 (10Ottomata) Status: Waiting for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/UniversalLanguageSelector/+...
[13:59:54] <ottomata>	 mforns:  o/ yt?
[14:02:47] <elukey>	 gooood morning
[14:06:39] <ottomata>	 morning!
[14:07:25] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson before racking the remaining 6 nodes (that we can do it in another task) could you check  an-worker1119 and an-worker1131 to see if they...
[14:14:30] <wikibugs>	 (03PS1) 10Ottomata: Add leggacy quicksurveyinitiation and quicksurveysresponses schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657341 (https://phabricator.wikimedia.org/T271165)
[14:19:30] <wikibugs>	 (03PS2) 10Ottomata: Add leggacy quicksurveyinitiation and quicksurveysresponses schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657341 (https://phabricator.wikimedia.org/T271165)
[14:20:22] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add leggacy quicksurveyinitiation and quicksurveysresponses schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657341 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata)
[15:01:19] <wikibugs>	 (03PS2) 10Milimetric: refinery-core: iputils: refresh cloud addresses [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657329 (https://phabricator.wikimedia.org/T272400) (owner: 10Arturo Borrero Gonzalez)
[15:01:25] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] refinery-core: iputils: refresh cloud addresses [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657329 (https://phabricator.wikimedia.org/T272400) (owner: 10Arturo Borrero Gonzalez)
[15:07:33] <wikibugs>	 (03Merged) 10jenkins-bot: refinery-core: iputils: refresh cloud addresses [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657329 (https://phabricator.wikimedia.org/T272400) (owner: 10Arturo Borrero Gonzalez)
[15:18:45] <mforns>	 heya ottomata :] I'm here
[15:19:03] <ottomata>	 heya, wanted to sync on status of migration stuff, also I missed a ping of yours yesterday
[15:19:28] <mforns>	 yes, sorry I didn't follow up yesterday
[15:19:32] <mforns>	 let's bc?
[15:20:24] <ottomata>	 2 mins
[15:20:26] <mforns>	 k
[15:39:03] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Performance-Team: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Ottomata)
[15:39:09] <wikibugs>	 10Analytics-Data-Quality, 10VisualEditor, 10WMDE-TechWish: Investigate missing dialog close events - https://phabricator.wikimedia.org/T272020 (10awight)
[15:41:14] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Performance-Team: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Ottomata) FYI we plan to migrate these schemas during the week of January 25 - 29.
[15:51:54] <milimetric>	 joal (for later): I noticed refinery-source now has /wmf/tmp/druid and refinery is doing /tmp_data_transfer/*  I will update the refinery change to /wmf/tmp/{druid,analytics} and commit/deploy/restart
[15:52:33] <milimetric>	 I'll also follow up on Marcel's comments
[15:59:41] <wikibugs>	 10Analytics, 10Event-Platform, 10Language-analytics, 10MW-1.36-notes (1.36.0-wmf.27; 2021-01-19): UniversalLanguageSelector Event Platform Migration - https://phabricator.wikimedia.org/T267352 (10Ottomata)
[16:05:48] <mforns>	 ottomata: gave a +2 to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/QuickSurveys/+/657355
[16:10:50] <ottomata>	 thanks mforns 
[16:26:41] <joal>	 Heya
[16:26:47] <joal>	 milimetric: Hi!
[16:27:04] <milimetric>	 o/
[16:27:41] <joal>	 milimetric: I hope you've not started patching refinery - I provided a v2 version of that with the correct patvch :S
[16:29:07] <elukey>	 ottomata: es-internal deployed in staging!
[16:29:18] <joal>	 milimetric: I have updated my 4 patches (2 puppet, 1 refinery, 1 refinery-source)
[16:29:31] <elukey>	 sorry if it proceeds super slow but I am a n00b and I am updating the docs as I go
[16:29:42] <milimetric>	 oh ok, I missed that, it's ok, I'll merge and start deploying then?
[16:31:27] <joal>	 milimetric: I still have to patch puppet again to change event-sanitized job refinery-source version, and then it'll be oozie restarts
[16:31:48] <joal>	 milimetric: let's do that after standup, so that we can concentrate?
[16:31:51] <ottomata>	 elukey:  OhOoboy!
[16:31:53] <milimetric>	 joal: ok, let me do the oozie restarts though
[16:32:17] <joal>	 milimetric: we can organize, you taking some of it will be great, yes :) Thanks :)
[16:32:42] <joal>	 milimetric: there also is FS operations, to make perms coherent
[16:33:05] <milimetric>	 yep
[16:33:40] <milimetric>	 order should be deploy / kill jobs / change perms / merge puppet / start jobs
[16:34:51] <joal>	 milimetric: I'd actually go for: merge, kill-restart gently, possibly with some backfilling to triple check newly generated data, and finally change perms to already existing data
[16:35:14] <joal>	 milimetric: and actually, puppet changes first (hdfs folders creation)
[16:37:06] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Change DataFrameToDruid base temporary path [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[16:39:31] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 4 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata)
[16:39:50] <elukey>	 joal: for the HDFS paths - should we also make sure that puppet creates /wmf/tmp or is it handled automatically ? I didn't check the code in puppet that we have
[16:40:09] <joal>	 elukey: puppet code does mkdir -p so we should be good
[16:41:40] <elukey>	 perfect
[16:41:58] <wikibugs>	 10Analytics, 10Data-Persistence-Backup: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10razzi) @jcrespo It looks like this is normal - traffic to wikimediafoundation.org has spiked since the 20th birthday last week, so the access logs...
[16:43:15] <milimetric>	 joal: I found something small, related to sqoop.  Will send a follow-up patch to your refinery change and one additional puppet change
[16:44:22] <joal>	 ack milimetric thanks!
[16:44:34] <wikibugs>	 10Analytics, 10Data-Persistence-Backup: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10jcrespo) 05Open→03Resolved a:03razzi Cool thanks. I initially filed this because I had missinterpreted the data as the data shrinking (not g...
[16:47:42] <milimetric>	 finally joal I noticed you didn't change this one, is it special for some reason or should I change it too? https://gerrit.wikimedia.org/g/analytics/refinery/+/e7916337e11352f6d79e9fe0b9d0f268ff9f7577/oozie/article_recommender/coordinator.properties#71
[16:48:18] <joal>	 milimetric: this code is dead - it never ran and nobody maintains it - That's why I didn't change - We can do it if you wish, but I'd rather delete that
[16:48:32] <milimetric>	 joal: ok, deleting the code then
[16:48:50] <joal>	 Thanks for that milimetric - it's probably worth it;s own patch
[16:49:03] <milimetric>	 will do in a follow-up
[16:49:09] <joal>	 milimetric: this code belongs to research, we hsould follow up with them before acting (maybe?)
[16:49:42] <milimetric>	 it's easy enough to undo, better to ask forgiveness
[16:50:00] <joal>	 ack milimetric - go for it
[16:51:40] <tltaylor>	 note for later: I want to follow up on the Big Top migration
[16:52:47] <wikibugs>	 (03PS3) 10Milimetric: Update oozie jobs tmp folders for ownership/perms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[16:55:12] <wikibugs>	 (03PS1) 10Milimetric: Delete article recommender job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657369
[16:55:18] <elukey>	 elukey@an-launcher1002:~$ ls -ld /mnt/hdfs/wmf/tmp/
[16:55:18] <elukey>	 drwxr-x--- 2 hdfs hadoop 4096 Jan 20 16:53 /mnt/hdfs/wmf/tmp/
[16:55:20] <wikibugs>	 (03CR) 10Joal: Update oozie jobs tmp folders for ownership/perms (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[16:55:20] <elukey>	 joal: --^
[16:55:31] <joal>	 tltaylor: \o/
[16:55:50] <joal>	 tltaylor: sorry for the ping
[16:55:58] <tltaylor>	 no worries
[16:56:02] <elukey>	 joal: we need to explicitly create the dir with different perms, otherwise the subdirs will not be accessible
[16:56:29] <joal>	 makes sense elukey 
[16:57:04] <wikibugs>	 (03PS4) 10Milimetric: Update oozie jobs tmp folders for ownership/perms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[16:57:39] <joal>	 elukey: we wish the /wmf/tmp folder to be o+rx right?
[16:58:01] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update oozie jobs tmp folders for ownership/perms (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[16:58:21] <wikibugs>	 (03PS2) 10Milimetric: Delete article recommender job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657369
[16:58:24] <elukey>	 joal: yes or maybe something like analytics:analytics, so the analytics user will be able to access it
[16:58:29] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Delete article recommender job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657369 (owner: 10Milimetric)
[16:58:39] <elukey>	 joal: ah no druid also needs it
[16:58:41] <joal>	 actually elukey, druid should also be able
[16:58:42] <elukey>	 so yes other 
[16:58:43] <joal>	 right
[16:58:45] <elukey>	 yes yes sorry
[16:58:46] <joal>	 ok
[16:58:50] <joal>	 Updating m patch
[16:58:57] <joal>	 actually, sending a new patch
[16:59:09] <elukey>	 joal: can you also add an explicit dependency with require?
[16:59:21] <elukey>	 in the existing dir creations, something like
[16:59:37] <elukey>	 require => Cdh::etc..
[16:59:54] <elukey>	 so /wmf/tmp will be created first etc..
[17:00:03] <joal>	 Will try :)
[17:00:06] <elukey>	 lemme give you the exact require, one sec
[17:02:13] <elukey>	 require => Cdh::Hadoop::Directory['/wmf/tmp']
[17:02:20] <elukey>	 it should work fine
[17:02:29] <elukey>	 puppet does it for File resources but this is different
[17:02:35] <elukey>	 *does it automatically
[17:02:58] <ottomata>	 mforns:  FYI i updated https://phabricator.wikimedia.org/T259163 with instructions on how to test server side events using PHP repl
[17:03:10] <ottomata>	 mwscript shell.php --wiki testwiki
[17:03:14] <mforns>	 ottomata: ok thanks!
[17:04:32] <ottomata>	 fdans: standup!
[17:23:48] <mforns>	 joal: when you have time we can discuss the questions on session length spreadsheet
[17:23:52] <joal>	 a-team: We're going for a deploy of refin
[17:24:07] <mforns>	 ok
[17:24:30] <joal>	 ery-source first, 
[17:29:44] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Get up to deploying the service in staging, it seems working! Upd...
[17:37:52] <joal>	 ottomata, mforns - anything more than the patches for perms and refine-sanitize to be deployed for you folks?
[17:38:07] <mforns>	 joal: nothing on my side
[17:38:12] <joal>	 ack - thanks mforns 
[17:38:45] <joal>	 mforns: Will you have some time later for talk about sampling?
[17:38:55] <mforns>	 joal: of course!
[17:38:59] <joal>	 cool :)
[17:39:08] <mforns>	 joal: I have a meeting at 20h, but before or after is OK
[17:39:14] <joal>	 ack!
[17:47:00] <joal>	 ottomata: ping again --^
[17:50:01] <wikibugs>	 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) With @ottomatta we came up with a way to rollback a partition migration. When applying a migration, it prints the current state, which can be used to migr...
[17:50:14] <joal>	 ok nothing for Andrew
[17:50:49] <joal>	 Therefore: Merging patches in refinery-sources
[17:51:54] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[17:53:03] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657171 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal)
[17:54:14] <wikibugs>	 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10Ottomata) FYI, the controller bounce idea we got from https://users.kafka.apache.narkive.com/epBsWAPC/stuck-re-balance
[17:54:26] <joal>	 milimetric: I updated the deployment etherpad - https://etherpad.wikimedia.org/p/analytics-weekly-train
[17:59:04] <wikibugs>	 (03Merged) 10jenkins-bot: Fix DataFrameExtension.convertToSchema repartition [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657171 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal)
[18:02:09] <joal>	 milimetric: We need a ne patch for refinery before deploy - it arrives
[18:02:48] <milimetric>	 joal: ok, I'm in an interview for an hour anyway so no rush
[18:02:57] <joal>	 ack milimetric 
[18:08:17] <wikibugs>	 (03PS1) 10Joal: Update geoeditors monthly jar version for wmcs IPs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657383 (https://phabricator.wikimedia.org/T272400)
[18:09:24] <joal>	 milimetric: that patch --^ requires us deploying refinery-source for refinery
[18:09:53] <joal>	 Ok, dpeloying refinery-source
[18:11:32] <joal>	 !log Release refinery-source v0.0.144 to archiva with Jenkins
[18:11:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:13:37] <wikibugs>	 (03PS1) 10Joal: Bump changelog.md for v0.0.144 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657384
[18:14:21] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for dpeloy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657384 (owner: 10Joal)
[18:15:24] <mw-jenkinsbot>	 Starting build #67 for job analytics-refinery-maven-release-docker
[18:28:19] <mw-jenkinsbot>	 Project analytics-refinery-maven-release-docker build #67: 09SUCCESS in 12 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/67/
[18:35:48] <mw-jenkinsbot>	 Starting build #34 for job analytics-refinery-update-jars-docker
[18:36:12] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.0.144 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657387
[18:36:12] <mw-jenkinsbot>	 Project analytics-refinery-update-jars-docker build #34: 09SUCCESS in 24 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/34/
[18:36:21] <joal>	 hey mforns - do we take a few minutes now?
[18:36:29] <mforns>	 joal: yes!
[18:36:32] <mforns>	 bc?
[18:36:34] <joal>	 mforns: tardis!
[18:36:38] <mforns>	 k
[18:43:33] <elukey>	 joal: qq - did we decide to go with replication factor 2 for the backup cluster right?
[18:43:50] <joal>	 we did indeed elukey 
[18:44:30] <elukey>	 ack perfect
[18:44:38] <elukey>	 the config looks good then, tomorrow I'll merge
[18:44:51] <joal>	 awesome elukey 
[18:45:41] <elukey>	 razzi,ottomata - going afk if I am not needed 
[18:46:30] <razzi>	 elukey: sounds good, this migration is on track to finish in 20 minutes and we'll be working on the plan for the rest
[18:46:59] <elukey>	 razzi: perfect, nice job :)
[18:48:34] * elukey afk! 
[19:02:11] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657387 (owner: 10Maven-release-user)
[19:14:11] <wikibugs>	 10Analytics-Radar, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson @elukey I swapped the SSD.  The only spare I had is 300GB. It's new.  Feel free to do what you need. I am resolving this t...
[19:14:59] <joal>	 ok milimetric - refinery is ready to beployed I think
[19:15:56] <joal>	 milimetric: In addition to the restarts for perms, there is the geoeditors-monthly to restart to update the hive-jar for WMCS-IPs updates (documented on train etherpad)
[19:23:16] <ottomata>	 Pchelolo: milimetric  FYI: https://phabricator.wikimedia.org/T120242#6763163
[19:23:17] <ottomata>	 :)
[19:23:28] <wikibugs>	 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) FYI, I had a chat with @Krinkle yesterday, and he informed be that for all MediaWIki browser client generated writes to MediaWiki...
[19:25:59] <joal>	 ottomata: are you still in ops with razzi or can I take your time for a minute?
[19:26:33] <razzi>	 joal: we split as the migration continued, now it finished :)
[19:26:57] <joal>	 Thanks for the update razzi :)
[20:02:16] <wikibugs>	 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Evaluate a differentially private solution to release wikipedia's  project-title-country data - https://phabricator.wikimedia.org/T267283 (10Nuria) Parking some thoughts from my conversation with @Isaac after his good work this pa...
[20:21:14] <joal>	 Gone for tonight team - see y'all tomorrow
[20:39:08] <ottomata>	 joal AHHH
[20:39:12] <ottomata>	 i mised your ping!!!
[20:39:14] <ottomata>	 sorry!
[20:39:24] <ottomata>	 (nice razzi! )
[20:49:21] <wikibugs>	 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Evaluate a differentially private solution to release wikipedia's  project-title-country data - https://phabricator.wikimedia.org/T267283 (10TedTed) @Isaac, that's an amazing demo! I love this, thanks for your work and thoughtful...
[21:04:39] <wikibugs>	 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) Migrated the following topics on kafka-jumbo:  codfw.mediawiki.revision-create eqiad.mediawiki.revision-create  The migrations still to be run are on kafk...
[21:18:31] <wikibugs>	 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) One more useful command: to change the throttle rate, run the on the node data is coming from and the node the data is going to. For example, if data is b...
[21:19:17] <wikibugs>	 (03PS1) 10Milimetric: Update country blacklist based on latest reports [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657413
[21:20:26] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "144 deployed, merging this" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657383 (https://phabricator.wikimedia.org/T272400) (owner: 10Joal)
[21:21:07] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update country blacklist based on latest reports [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657413 (owner: 10Milimetric)
[21:48:58] <milimetric>	 !log refinery deployed, synced to hdfs, ready to restart 53 oozie jobs, will do so slowly over the next few hours
[21:49:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:33:34] <fdans>	 milimetric: the couple jobs that just fail, is this you?
[22:33:45] <fdans>	 failed *
[23:45:37] <milimetric>	 fdans: I'm looking into it, it's probably me, yes, I've restarted like 40 jobs so far
[23:47:03] <milimetric>	 fdans: I don't see any failures, just SLA alerts (those fire when jobs are restarted, I sent an email ahead of all the spam)
[23:47:20] <fdans>	 milimetric: yea my bad, they're totally sla alerts