[02:17:59] (03CR) 10Gergő Tisza: [C: 04-1] "Commit subject must be followed by an empty line." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [02:18:13] (03PS5) 10Gergő Tisza: Migrate MobileWikiAppDailyStats to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/722964 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [04:30:18] 10Analytics, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10Marostegui) p:05Triage→03Medium [05:27:30] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10Product-Analytics (Kanban), 10User-Johan: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10MMiller_WMF) I just re-organized the description to clarify actionable analytics questions. We will figure out this week how to re... [05:27:38] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10Product-Analytics (Kanban), 10User-Johan: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10MMiller_WMF) [07:12:42] (03CR) 10Joal: Fix hdfs-cleaner script using shaded jar (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [07:56:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [analytics/statsv] - 10https://gerrit.wikimedia.org/r/721044 (https://phabricator.wikimedia.org/T290131) (owner: 10Dave Pifke) [08:11:31] 10Analytics, 10Analytics-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Repair 3 of 4 has completed successfully. ` [2021-09-26 12:43:22,108] Repair session 557db261-1d46-11ec-831d-519c7747ad64 for range (-3038345190283360344,-30295... [08:12:36] 10Analytics, 10Analytics-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Final repair for this operation is under way. ` btullis@aqs1007:~$ sudo nodetool-b repair --full local_group_default_T_mediarequest_per_file data [2021-09-27 08... [08:17:21] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) Reload of all 4 snapshots of these repaired tables is under way. [08:20:01] 10Analytics: Add wikitech (labswiki) to the sqoop list - https://phabricator.wikimedia.org/T217792 (10Marostegui) [09:36:34] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) Thanks for your feedback @urbanecm - I'm not 100% clear on the history of stat1005 but it looks like the... [09:43:53] automatic kerberos token renewal *_* THANKS! [10:03:04] tanny411: You're very welcome :-) [10:04:54] * urbanecm is looking for that working in stat1005 too 🙂 [10:15:04] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) I've researched the history of stat1005 as much as I think I need to in order to make a decision. It loo... [10:16:47] urbanecm: Just waiting for a second pair of eyes from the team on this: https://phabricator.wikimedia.org/T268985#7379800 then it should be good to go. [10:17:47] great! [10:27:22] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10MoritzMuehlenhoff) Looking at /var/log/installer it seems stat1005 was installed in 2019 with Stretch and then la... [11:07:54] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) The load of all of the repaired tables from the 4 instances on aqs1004 and aqs1007 is complete. @JAllemandou - Is there a testing process that you would like... [11:08:05] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10BTullis) [11:11:42] !log btullis@stat1005:~$ sudo apt install usrmerge [11:11:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:12:27] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10JAllemandou) a:05BTullis→03JAllemandou Thanks @BTullis :) Assigning the task to myself for the last round of checks, then I'll close it. [11:12:29] 10Analytics, 10Analytics-Kanban: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10JAllemandou) [11:15:36] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) Installing this now. The following debconf question was displayed. {F34659171} Answered yes. One unexpe... [11:17:45] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) The `molly-guard` package was reinstalled automatically during the next puppet agent run. ` Info: Applyi... [11:18:15] !log btullis@stat1005:~$ sudo apt purge usrmerge [11:18:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:19:29] urbanecm: kerberos auto-renewal on stat1005 should now work. [11:20:56] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) Ticket auto-renewal now works for me: ` You have a valid Kerberos ticket. Creating automatic Kerberos... [12:23:48] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 3.476 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [12:24:27] btullis: let me test! [12:24:52] btullis: works perfectly! thanks [12:29:45] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.6082 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [12:55:51] 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Ottomata) > If I can do beeline in stat1005 and look at the data This would be possible, but you'd have to e... [12:56:20] (03CR) 10Kosta Harlan: [C: 03+2] Add a link: Update action_data for back, next actions to account for navigation type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723601 (https://phabricator.wikimedia.org/T290316) (owner: 10MewOphaswongse) [12:56:57] (03Merged) 10jenkins-bot: Add a link: Update action_data for back, next actions to account for navigation type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/723601 (https://phabricator.wikimedia.org/T290316) (owner: 10MewOphaswongse) [12:58:32] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10Ottomata) Nice! [13:00:10] (03CR) 10Ottomata: "removing old versions is something we have to worry about for all our other jobs too; its probably better to treat this one exactly the sa" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [13:19:57] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:22:59] (03CR) 10Joal: "You're right ottomata :) Updating patch now." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [13:24:30] (03PS2) 10Joal: Add jar-version in hdfs-cleaner script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) [13:25:42] (03CR) 10Ottomata: [C: 03+2] Add jar-version in hdfs-cleaner script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [13:25:46] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add jar-version in hdfs-cleaner script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [13:26:02] joal: should i do a quick deploy of that to an-launcher? [13:26:38] ottomata: We'll deploy with btullis tomorrow, it was part of the plan :) [13:26:46] thank you for offering ottomata :) [13:26:57] ottomata: IIRC btullis downtimed alerts on purpose [13:27:01] ok great [13:27:01] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) I think we should just proceed with eventgate, I'll do staging in each first. Will have to delete in staging... [13:29:25] I downtimed the individual statemd units for these, but I didn't downtime the aggreagated alert in case we got a failure from another systemd unit. [13:29:28] https://usercontent.irccloud-cdn.com/file/CkA6xJTZ/image.png [13:29:56] (03PS1) 10GoranSMilovanovic: T286242 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/724063 [13:30:16] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T286242 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/724063 (owner: 10GoranSMilovanovic) [13:30:57] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:31:00] https://usercontent.irccloud-cdn.com/file/TNH2IDhR/image.png [13:32:17] +1 [13:35:25] 10Analytics, 10Analytics-Kanban: Snapshot and Reload cassandra2 pageview_per_file data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) [13:39:42] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Plan for staging: ` helmfile -e staging destroy # wait and make sure all is gone. helmfile -e staging apply `... [13:43:08] (03CR) 10Btullis: Add jar-version in hdfs-cleaner script (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [13:43:24] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Joe) Don't forget to wait for the DNS TTL and/or lower the TTL before every depool/repool operation. so you might want... [13:46:57] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Thanks, added this step into my comment above. [13:54:52] (03PS1) 10Joal: Fix typo in hdfs-cleaner previous patch [analytics/refinery] - 10https://gerrit.wikimedia.org/r/724066 [13:55:32] (03CR) 10Joal: Add jar-version in hdfs-cleaner script (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/723516 (https://phabricator.wikimedia.org/T217967) (owner: 10Joal) [13:55:55] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/724066 (owner: 10Joal) [14:27:57] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:52:51] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Milimetric) I agree we should drop them. I vaguely remember this kind of special access to some versions of the old reportcard, and even back then use was v... [14:58:28] 10Analytics, 10Product-Analytics, 10Editing-team (Tracking): Add MariaDB replicas to Superset - https://phabricator.wikimedia.org/T291195 (10Milimetric) >>! In T291195#7363033, @mpopov wrote: > @elukey: Megan's most pressing use case is [[ https://www.mediawiki.org/wiki/Extension:DiscussionTools/discussionto... [15:07:57] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) [15:08:02] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) all firmware updated [15:45:13] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:58:21] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:08:27] 10Analytics, 10Event-Platform, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Ah, there were some mistakes in our patches: the tls Service wasn't using the same label selectors that the po... [16:09:56] 10Analytics, 10Data-Engineering: [Session length] Apply different sample rates per wiki - https://phabricator.wikimedia.org/T291693 (10odimitrijevic) p:05Triage→03High [16:11:41] 10Analytics-Clusters, 10Data-Engineering: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664 (10odimitrijevic) [16:12:50] 10Analytics, 10Data-Engineering, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10odimitrijevic) [16:14:50] 10Analytics, 10Data-Engineering: Analytics-hadoop Spark3 package upgrade (production) - https://phabricator.wikimedia.org/T291466 (10odimitrijevic) p:05Triage→03Medium [16:16:10] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_file data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10odimitrijevic) [16:16:48] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10odimitrijevic) [16:17:29] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10odimitrijevic) [16:18:39] 10Analytics, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10wiki_willy) a:03Cmjohnson [16:20:28] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:21:10] 10Analytics-Radar, 10Wikipedia-iOS-App-Backlog, 10Product-Analytics (Kanban), 10User-Johan: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10odimitrijevic) [16:22:02] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10odimitrijevic) [17:12:44] (03CR) 10Michael DiPietro: [C: 03+2] add stop status [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [17:16:50] (03Merged) 10jenkins-bot: add stop status [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [17:27:46] 10Analytics, 10Data-Engineering, 10Event-Platform, 10serviceops: Enable envoy tls proxy logging from eventgate - https://phabricator.wikimedia.org/T291856 (10Ottomata) [17:35:56] ottomata: heya - batcave for 1/2h? [17:39:05] joal: ok! [17:39:23] in bc [18:03:10] 10Analytics-Radar, 10Product-Analytics: Do the messages left for unregistered or logged-out IP editors get read by those editors? - https://phabricator.wikimedia.org/T291297 (10nettrom_WMF) Update: we'll triage this tomorrow, Sept 28. [18:09:56] 10Analytics, 10Data-Engineering, 10Growth-Team, 10Metrics-Platform, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) Looks like this got deployed last week with the train? I'm not seeing any changes in t... [18:20:22] 10Analytics, 10Data-Engineering, 10Growth-Team, 10Metrics-Platform, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) We need to enable it per eventgate service. Patch OTW... [18:28:51] ottomata: heya - are you still there? [18:31:10] joal yuppers [18:31:12] wassuuuPp [18:31:16] bc? [18:31:23] ottomata: quick idea about events task - cave? [18:31:25] ya [18:33:07] 10Analytics-Radar, 10Product-Analytics, 10Wikipedia-Android-App-Backlog (Android Release FY2021-22): What percentage of app editors are IP editors? - https://phabricator.wikimedia.org/T291866 (10JTannerWMF) [18:52:50] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Growth-Team, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) [18:56:18] 10Analytics-Radar, 10Product-Analytics, 10Wikipedia-Android-App-Backlog (Android Release FY2021-22): What percentage of app editors are IP editors? - https://phabricator.wikimedia.org/T291866 (10JTannerWMF) p:05Triage→03Medium [18:56:48] 10Analytics, 10Performance-Team: Check home/HDFS leftovers of gilles - https://phabricator.wikimedia.org/T290232 (10Krinkle) a:03Krinkle [19:04:38] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Growth-Team, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) There is a bug in the code, so I had to revert my config patch. Hope to follow up this w... [19:20:57] Amir1: we don't have mw job events in hadoop anymore (we could put them back) [19:21:08] but...you can consume them directly from kafka if that is helpful [19:30:09] 10Analytics-Radar, 10Wikipedia-iOS-App-Backlog, 10Product-Analytics (Kanban), 10User-Johan: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10JMinor) [19:31:34] 10Analytics-Radar, 10Wikipedia-iOS-App-Backlog, 10Product-Analytics (Kanban), 10User-Johan: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10JMinor) [19:55:09] (03PS1) 10Andrew Bogott: Added test_results.py [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/724169 [19:56:51] (03CR) 10jerkins-bot: [V: 04-1] Added test_results.py [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/724169 (owner: 10Andrew Bogott) [19:59:58] (03CR) 10Andrew Bogott: "recheck" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/724169 (owner: 10Andrew Bogott) [20:04:34] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Growth-Team, and 5 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) >>! In T288853#7381469, @Ottomata wrote: > There is a bug in the code, so I had to rev... [21:14:01] 10Analytics-Radar, 10Wikipedia-iOS-App-Backlog, 10Product-Analytics (Kanban), 10User-Johan: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10Isaac) > I would suspect that these IPs wouldn't be producing many (any?) automated actions, as inherently any requests from... [21:14:55] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform, 10serviceops: Enable envoy tls proxy logging from eventgate - https://phabricator.wikimedia.org/T291856 (10Ottomata) [22:02:33] 10Analytics-Radar, 10Dumps-Generation, 10Machine-Learning-Team, 10ORES, and 5 others: [Epic] Make ORES scores for wikidata available as a dump - https://phabricator.wikimedia.org/T209611 (10So9q)