[00:55:00] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [00:58:08] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:03:54] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:05:16] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/page [01:05:16] e/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:06:52] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:07:28] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Ge [01:07:28] sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:10:36] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:11:26] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:16:18] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Ge [01:16:18] sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:17:30] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITIC [01:17:30] article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:17:32] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Ge [01:17:32] sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:18:24] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/page [01:18:24] e/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:26:32] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRIT [01:26:32] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:33:42] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:34:32] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:35:44] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:36:16] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [01:36:52] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:12:10] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimed [02:12:10] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:15:18] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:23:04] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/me [02:23:04] file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:24:00] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wi [02:24:00] /Services/Monitoring/aqs [02:26:22] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:29:54] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per [02:29:54] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:33:08] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:33:56] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [05:25:41] (03PS1) 10Lex Nasser: Create pageviews 'top-per-country' endpoint with tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171) [05:30:07] (03CR) 10Lex Nasser: "This has been tested locally with SQLite backend, but not with Cassandra. Will do further testing before merging. Please let me know if yo" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [06:44:34] good morning! [06:44:46] AQS had fun during the night :( [06:48:28] there was a big jump in traffic before the event [06:48:29] https://grafana.wikimedia.org/d/000000526/aqs?orgId=1&from=now-12h&to=now [06:48:39] but not rigtht before [06:54:08] https://grafana.wikimedia.org/d/000000417/cassandra-system?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=aqs&var-server=All&var-disk=sda&var-disk=sdb&var-disk=sdc&var-disk=sdd&var-disk=sde [06:55:10] so my early idea is that cassandra was able to sustain all that traffic for some hours, until it reached its limit and timeouts started to happen [06:56:25] IIRC we don't really have any access log for AQS, so it is difficult to figure out what it was, I suspect a bot [07:08:50] another interesting graph is [07:08:51] https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_pageviews_per_project&var-table=data&var-quantile=99p [07:09:05] so read latency was very high for the whole time (hours) [07:09:12] but there is a big peak of write requests [07:10:23] I am wondering if the daily oozie load jobs kicked in, causing the already delicate state of cassandra to finally collapse [07:21:13] https://grafana.wikimedia.org/d/000000418/cassandra?viewPanel=15&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_pageviews_per_project&var-table=data&var-quantile=99p [07:23:50] it matches with the alarms [07:24:28] even if there was a second burst as well [07:27:01] first wave (UTC) 00:55 -> 01:36 [07:27:16] second wave (UTC) 02:12 -> 02:33 [07:27:23] (hope to have got them right) [07:27:34] the peak in writes is within the first wave [07:27:42] but then why it collapsed again? [07:27:48] There was a lot of traffic anyway [07:28:08] ah wait a lot of compactions happening as well [07:29:47] ok I think that for the moment I can re-run the oozie jobs [07:33:00] ERROR [MessagingService-Incoming-/10.64.48.149] 2021-01-20 01:29:02,639 CassandraDaemon.java:185 - Exception in thread Thread[MessagingSe [07:33:03] rvice-Incoming-/10.64.48.149,5,main] [07:33:06] java.lang.OutOfMemoryError: Java heap space [07:33:07] sigh [07:34:41] all right it is way simpler [07:34:43] https://grafana.wikimedia.org/d/000000418/cassandra?viewPanel=28&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_editors_bycountry&var-table=data&var-quantile=99p&from=now-14d&to=now [07:35:29] so we have been using almost all the heap for a long time (16G at the moment) and when we went under pressure, Cassandra went OOM and some instances restarted [07:36:46] we need to bump it, probably to 24G [07:37:50] ah no wait, of course we have 2 instances [07:39:06] so on aqs1004-6 it will be a problem, we have only 64G of ram [07:39:18] on the other nodes, we have 128G so it shouldn't be an issue [07:51:35] all right all failed coords restarted [07:56:19] brb [08:10:33] I am trying to see in turnilo if I get some clue [08:11:52] I am filtering webrequest_128 for /api/rest_v1/metrics/pageviews/* [08:13:46] so there is a python-requests/2.22.0 UA that matches with the rise in traffic [08:29:10] https://w.wiki/uzb [08:31:13] I created https://gerrit.wikimedia.org/r/657288 [08:31:29] to avoid bots without a clear UA to hit AQS [08:55:04] ah weird jobs failing again, I was probably too aggressive in restarts [09:10:18] 10Analytics: Fix the remaining bugs open on for Hue next - https://phabricator.wikimedia.org/T264896 (10elukey) [09:50:24] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10ema) >>! In T271953#6755027, @elukey wrote: > @ema quick question - is the client src port something that we could pass from ATS-TLS to Varnish frontend? Via HTTP heade... [10:15:18] elukey: I had one comment on that UA/AQS patch [10:19:49] klausman: morning! yep answered, thanks a lot for following up, any suggestion is welcome [10:20:06] I am about to send an email with the summary of my understanding of this mess to alert@ [10:23:07] klausman: sent! [10:27:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10elukey) Reporting a chat with the Traffic team happened this morning: - https://gerrit.wikimedia.org/r/657296 is needed since ATS-TLS doesn't really add any content to... [10:50:44] 10Analytics: Druid datasource drop triggers segment reshuffling by the coordinator - https://phabricator.wikimedia.org/T270173 (10elukey) I followed up to my own thread (sent a looong time ago) to druid's user@ mailing list, let's see if anybody comes back with some suggestion.. [10:52:25] ok so follow up on present/past outages done, aaand my morning is gone :( [10:52:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10ema) >>! In T271953#6761141, @elukey wrote: > - In Varnish we'll need to add VCL code to set the new parameters to X-Analytics, so that Varnishkafka will pick them up a... [11:39:41] * elukey lunch! [12:12:32] (03PS1) 10Arturo Borrero Gonzalez: refinery-core: iputils: refresh cloud addresses [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657329 (https://phabricator.wikimedia.org/T272400) [13:21:04] (03CR) 10Joal: "Thanks for the review mforns :)" (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [13:22:38] (03PS2) 10Joal: Update oozie jobs tmp folders for ownership/perms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) [13:26:10] (03PS2) 10Joal: Change DataFrameToDruid base temporary path [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) [13:26:34] (03CR) 10Joal: "Path changed to /wmf/tmp/druid" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [13:32:53] <- late lunch and some groceries, back in 15 or so. [13:45:13] I updated the various patches I had ongoing to fix permissions /tmp folders - I used /wmf/tmp/analytics and /wmf/tmp/druid - Hopefully we'll deploy that tonight :) [13:48:04] going to check them in a bit :) [13:48:19] thanks elukey - I hope the folder choice is ok [13:49:31] yep yep I like it a lot, thanks for the patience :) [13:50:06] no problemo elukey - thanks for raising the concern :) [13:54:11] 10Analytics, 10Analytics-Kanban, 10Anti-Harassment, 10Event-Platform, and 2 others: Migrate Anti-Harassment EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T268517 (10Ottomata) [13:57:22] 10Analytics, 10Event-Platform, 10Language-analytics, 10MW-1.36-notes (1.36.0-wmf.27; 2021-01-19): UniversalLanguageSelector Event Platform Migration - https://phabricator.wikimedia.org/T267352 (10Ottomata) Status: Waiting for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/UniversalLanguageSelector/+... [13:59:54] mforns: o/ yt? [14:02:47] gooood morning [14:06:39] morning! [14:07:25] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson before racking the remaining 6 nodes (that we can do it in another task) could you check an-worker1119 and an-worker1131 to see if they... [14:14:30] (03PS1) 10Ottomata: Add leggacy quicksurveyinitiation and quicksurveysresponses schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657341 (https://phabricator.wikimedia.org/T271165) [14:19:30] (03PS2) 10Ottomata: Add leggacy quicksurveyinitiation and quicksurveysresponses schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657341 (https://phabricator.wikimedia.org/T271165) [14:20:22] (03CR) 10Ottomata: [C: 03+2] Add leggacy quicksurveyinitiation and quicksurveysresponses schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657341 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata) [15:01:19] (03PS2) 10Milimetric: refinery-core: iputils: refresh cloud addresses [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657329 (https://phabricator.wikimedia.org/T272400) (owner: 10Arturo Borrero Gonzalez) [15:01:25] (03CR) 10Milimetric: [C: 03+2] refinery-core: iputils: refresh cloud addresses [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657329 (https://phabricator.wikimedia.org/T272400) (owner: 10Arturo Borrero Gonzalez) [15:07:33] (03Merged) 10jenkins-bot: refinery-core: iputils: refresh cloud addresses [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657329 (https://phabricator.wikimedia.org/T272400) (owner: 10Arturo Borrero Gonzalez) [15:18:45] heya ottomata :] I'm here [15:19:03] heya, wanted to sync on status of migration stuff, also I missed a ping of yours yesterday [15:19:28] yes, sorry I didn't follow up yesterday [15:19:32] let's bc? [15:20:24] 2 mins [15:20:26] k [15:39:03] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Performance-Team: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Ottomata) [15:39:09] 10Analytics-Data-Quality, 10VisualEditor, 10WMDE-TechWish: Investigate missing dialog close events - https://phabricator.wikimedia.org/T272020 (10awight) [15:41:14] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Performance-Team: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Ottomata) FYI we plan to migrate these schemas during the week of January 25 - 29. [15:51:54] joal (for later): I noticed refinery-source now has /wmf/tmp/druid and refinery is doing /tmp_data_transfer/* I will update the refinery change to /wmf/tmp/{druid,analytics} and commit/deploy/restart [15:52:33] I'll also follow up on Marcel's comments [15:59:41] 10Analytics, 10Event-Platform, 10Language-analytics, 10MW-1.36-notes (1.36.0-wmf.27; 2021-01-19): UniversalLanguageSelector Event Platform Migration - https://phabricator.wikimedia.org/T267352 (10Ottomata) [16:05:48] ottomata: gave a +2 to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/QuickSurveys/+/657355 [16:10:50] thanks mforns [16:26:41] Heya [16:26:47] milimetric: Hi! [16:27:04] o/ [16:27:41] milimetric: I hope you've not started patching refinery - I provided a v2 version of that with the correct patvch :S [16:29:07] ottomata: es-internal deployed in staging! [16:29:18] milimetric: I have updated my 4 patches (2 puppet, 1 refinery, 1 refinery-source) [16:29:31] sorry if it proceeds super slow but I am a n00b and I am updating the docs as I go [16:29:42] oh ok, I missed that, it's ok, I'll merge and start deploying then? [16:31:27] milimetric: I still have to patch puppet again to change event-sanitized job refinery-source version, and then it'll be oozie restarts [16:31:48] milimetric: let's do that after standup, so that we can concentrate? [16:31:51] elukey: OhOoboy! [16:31:53] joal: ok, let me do the oozie restarts though [16:32:17] milimetric: we can organize, you taking some of it will be great, yes :) Thanks :) [16:32:42] milimetric: there also is FS operations, to make perms coherent [16:33:05] yep [16:33:40] order should be deploy / kill jobs / change perms / merge puppet / start jobs [16:34:51] milimetric: I'd actually go for: merge, kill-restart gently, possibly with some backfilling to triple check newly generated data, and finally change perms to already existing data [16:35:14] milimetric: and actually, puppet changes first (hdfs folders creation) [16:37:06] (03CR) 10Milimetric: [C: 03+2] Change DataFrameToDruid base temporary path [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [16:39:31] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 4 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata) [16:39:50] joal: for the HDFS paths - should we also make sure that puppet creates /wmf/tmp or is it handled automatically ? I didn't check the code in puppet that we have [16:40:09] elukey: puppet code does mkdir -p so we should be good [16:41:40] perfect [16:41:58] 10Analytics, 10Data-Persistence-Backup: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10razzi) @jcrespo It looks like this is normal - traffic to wikimediafoundation.org has spiked since the 20th birthday last week, so the access logs... [16:43:15] joal: I found something small, related to sqoop. Will send a follow-up patch to your refinery change and one additional puppet change [16:44:22] ack milimetric thanks! [16:44:34] 10Analytics, 10Data-Persistence-Backup: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10jcrespo) 05Open→03Resolved a:03razzi Cool thanks. I initially filed this because I had missinterpreted the data as the data shrinking (not g... [16:47:42] finally joal I noticed you didn't change this one, is it special for some reason or should I change it too? https://gerrit.wikimedia.org/g/analytics/refinery/+/e7916337e11352f6d79e9fe0b9d0f268ff9f7577/oozie/article_recommender/coordinator.properties#71 [16:48:18] milimetric: this code is dead - it never ran and nobody maintains it - That's why I didn't change - We can do it if you wish, but I'd rather delete that [16:48:32] joal: ok, deleting the code then [16:48:50] Thanks for that milimetric - it's probably worth it;s own patch [16:49:03] will do in a follow-up [16:49:09] milimetric: this code belongs to research, we hsould follow up with them before acting (maybe?) [16:49:42] it's easy enough to undo, better to ask forgiveness [16:50:00] ack milimetric - go for it [16:51:40] note for later: I want to follow up on the Big Top migration [16:52:47] (03PS3) 10Milimetric: Update oozie jobs tmp folders for ownership/perms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [16:55:12] (03PS1) 10Milimetric: Delete article recommender job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657369 [16:55:18] elukey@an-launcher1002:~$ ls -ld /mnt/hdfs/wmf/tmp/ [16:55:18] drwxr-x--- 2 hdfs hadoop 4096 Jan 20 16:53 /mnt/hdfs/wmf/tmp/ [16:55:20] (03CR) 10Joal: Update oozie jobs tmp folders for ownership/perms (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [16:55:20] joal: --^ [16:55:31] tltaylor: \o/ [16:55:50] tltaylor: sorry for the ping [16:55:58] no worries [16:56:02] joal: we need to explicitly create the dir with different perms, otherwise the subdirs will not be accessible [16:56:29] makes sense elukey [16:57:04] (03PS4) 10Milimetric: Update oozie jobs tmp folders for ownership/perms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [16:57:39] elukey: we wish the /wmf/tmp folder to be o+rx right? [16:58:01] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update oozie jobs tmp folders for ownership/perms (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [16:58:21] (03PS2) 10Milimetric: Delete article recommender job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657369 [16:58:24] joal: yes or maybe something like analytics:analytics, so the analytics user will be able to access it [16:58:29] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Delete article recommender job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657369 (owner: 10Milimetric) [16:58:39] joal: ah no druid also needs it [16:58:41] actually elukey, druid should also be able [16:58:42] so yes other [16:58:43] right [16:58:45] yes yes sorry [16:58:46] ok [16:58:50] Updating m patch [16:58:57] actually, sending a new patch [16:59:09] joal: can you also add an explicit dependency with require? [16:59:21] in the existing dir creations, something like [16:59:37] require => Cdh::etc.. [16:59:54] so /wmf/tmp will be created first etc.. [17:00:03] Will try :) [17:00:06] lemme give you the exact require, one sec [17:02:13] require => Cdh::Hadoop::Directory['/wmf/tmp'] [17:02:20] it should work fine [17:02:29] puppet does it for File resources but this is different [17:02:35] *does it automatically [17:02:58] mforns: FYI i updated https://phabricator.wikimedia.org/T259163 with instructions on how to test server side events using PHP repl [17:03:10] mwscript shell.php --wiki testwiki [17:03:14] ottomata: ok thanks! [17:04:32] fdans: standup! [17:23:48] joal: when you have time we can discuss the questions on session length spreadsheet [17:23:52] a-team: We're going for a deploy of refin [17:24:07] ok [17:24:30] ery-source first, [17:29:44] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Get up to deploying the service in staging, it seems working! Upd... [17:37:52] ottomata, mforns - anything more than the patches for perms and refine-sanitize to be deployed for you folks? [17:38:07] joal: nothing on my side [17:38:12] ack - thanks mforns [17:38:45] mforns: Will you have some time later for talk about sampling? [17:38:55] joal: of course! [17:38:59] cool :) [17:39:08] joal: I have a meeting at 20h, but before or after is OK [17:39:14] ack! [17:47:00] ottomata: ping again --^ [17:50:01] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) With @ottomatta we came up with a way to rollback a partition migration. When applying a migration, it prints the current state, which can be used to migr... [17:50:14] ok nothing for Andrew [17:50:49] Therefore: Merging patches in refinery-sources [17:51:54] (03CR) 10Joal: [C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [17:53:03] (03CR) 10Joal: [C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657171 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal) [17:54:14] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10Ottomata) FYI, the controller bounce idea we got from https://users.kafka.apache.narkive.com/epBsWAPC/stuck-re-balance [17:54:26] milimetric: I updated the deployment etherpad - https://etherpad.wikimedia.org/p/analytics-weekly-train [17:59:04] (03Merged) 10jenkins-bot: Fix DataFrameExtension.convertToSchema repartition [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657171 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal) [18:02:09] milimetric: We need a ne patch for refinery before deploy - it arrives [18:02:48] joal: ok, I'm in an interview for an hour anyway so no rush [18:02:57] ack milimetric [18:08:17] (03PS1) 10Joal: Update geoeditors monthly jar version for wmcs IPs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657383 (https://phabricator.wikimedia.org/T272400) [18:09:24] milimetric: that patch --^ requires us deploying refinery-source for refinery [18:09:53] Ok, dpeloying refinery-source [18:11:32] !log Release refinery-source v0.0.144 to archiva with Jenkins [18:11:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:13:37] (03PS1) 10Joal: Bump changelog.md for v0.0.144 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657384 [18:14:21] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for dpeloy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657384 (owner: 10Joal) [18:15:24] Starting build #67 for job analytics-refinery-maven-release-docker [18:28:19] Project analytics-refinery-maven-release-docker build #67: 09SUCCESS in 12 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/67/ [18:35:48] Starting build #34 for job analytics-refinery-update-jars-docker [18:36:12] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.0.144 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657387 [18:36:12] Project analytics-refinery-update-jars-docker build #34: 09SUCCESS in 24 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/34/ [18:36:21] hey mforns - do we take a few minutes now? [18:36:29] joal: yes! [18:36:32] bc? [18:36:34] mforns: tardis! [18:36:38] k [18:43:33] joal: qq - did we decide to go with replication factor 2 for the backup cluster right? [18:43:50] we did indeed elukey [18:44:30] ack perfect [18:44:38] the config looks good then, tomorrow I'll merge [18:44:51] awesome elukey [18:45:41] razzi,ottomata - going afk if I am not needed [18:46:30] elukey: sounds good, this migration is on track to finish in 20 minutes and we'll be working on the plan for the rest [18:46:59] razzi: perfect, nice job :) [18:48:34] * elukey afk! [19:02:11] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657387 (owner: 10Maven-release-user) [19:14:11] 10Analytics-Radar, 10SRE, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson @elukey I swapped the SSD. The only spare I had is 300GB. It's new. Feel free to do what you need. I am resolving this t... [19:14:59] ok milimetric - refinery is ready to beployed I think [19:15:56] milimetric: In addition to the restarts for perms, there is the geoeditors-monthly to restart to update the hive-jar for WMCS-IPs updates (documented on train etherpad) [19:23:16] Pchelolo: milimetric FYI: https://phabricator.wikimedia.org/T120242#6763163 [19:23:17] :) [19:23:28] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) FYI, I had a chat with @Krinkle yesterday, and he informed be that for all MediaWIki browser client generated writes to MediaWiki... [19:25:59] ottomata: are you still in ops with razzi or can I take your time for a minute? [19:26:33] joal: we split as the migration continued, now it finished :) [19:26:57] Thanks for the update razzi :) [20:02:16] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Evaluate a differentially private solution to release wikipedia's project-title-country data - https://phabricator.wikimedia.org/T267283 (10Nuria) Parking some thoughts from my conversation with @Isaac after his good work this pa... [20:21:14] Gone for tonight team - see y'all tomorrow [20:39:08] joal AHHH [20:39:12] i mised your ping!!! [20:39:14] sorry! [20:39:24] (nice razzi! ) [20:49:21] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Evaluate a differentially private solution to release wikipedia's project-title-country data - https://phabricator.wikimedia.org/T267283 (10TedTed) @Isaac, that's an amazing demo! I love this, thanks for your work and thoughtful... [21:04:39] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) Migrated the following topics on kafka-jumbo: codfw.mediawiki.revision-create eqiad.mediawiki.revision-create The migrations still to be run are on kafk... [21:18:31] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) One more useful command: to change the throttle rate, run the on the node data is coming from and the node the data is going to. For example, if data is b... [21:19:17] (03PS1) 10Milimetric: Update country blacklist based on latest reports [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657413 [21:20:26] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "144 deployed, merging this" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657383 (https://phabricator.wikimedia.org/T272400) (owner: 10Joal) [21:21:07] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update country blacklist based on latest reports [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657413 (owner: 10Milimetric) [21:48:58] !log refinery deployed, synced to hdfs, ready to restart 53 oozie jobs, will do so slowly over the next few hours [21:49:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:33:34] milimetric: the couple jobs that just fail, is this you? [22:33:45] failed * [23:45:37] fdans: I'm looking into it, it's probably me, yes, I've restarted like 40 jobs so far [23:47:03] fdans: I don't see any failures, just SLA alerts (those fire when jobs are restarted, I sent an email ahead of all the spam) [23:47:20] milimetric: yea my bad, they're totally sla alerts