[00:04:56] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker2094:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075334 (owner: 10TrainBranchBot) [00:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:12:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 810.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:13:43] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [01:17:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 810.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:01:03] (03PS1) 10RLazarus: deployment_server: More mwscript-k8s usability tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1075346 [02:13:55] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:56] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker2094:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:56] FIRING: SystemdUnitFailed: build-homepage.service on registry2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:57:32] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645#10174094 (10Cpetrillo) This would be very useful for us to be able to understand if known problematic reusers (see: https://phabric... [03:59:24] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-09-19-120927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074187 (owner: 10KartikMistry) [03:59:38] ^ Deploying MinT [04:00:27] (03Merged) 10jenkins-bot: Update MinT to 2024-09-19-120927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074187 (owner: 10KartikMistry) [04:01:25] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [04:04:56] FIRING: [2x] SystemdUnitFailed: build-homepage.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:57] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [04:09:56] FIRING: [2x] SystemdUnitFailed: build-homepage.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:05] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [04:14:56] RESOLVED: [2x] SystemdUnitFailed: build-homepage.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:20:24] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [04:23:45] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [04:32:52] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [04:38:48] ૅ!log Updated MinT to 2024-09-19-120927-production [04:39:14] !log Updated MinT to 2024-09-19-120927-production [04:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:32:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10174130 (10ABran-WMF) a:05VRiley-WMF→03ABran-WMF great @Jclark-ctr thanks! will take it over from here :-) [05:38:19] 06SRE, 06DBA: Parsercache primary master databases should monitor replication - https://phabricator.wikimedia.org/T375395#10174139 (10ABran-WMF) 05Open→03Resolved a:03ABran-WMF @jcrespo I'll close this ticket as described in T373037#10174135, both task have been tied together. [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T0600) [06:05:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:55] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10174150 (10ABran-WMF) p:05High→03Medium [06:41:54] (03PS1) 10Muehlenhoff: Remove LDAP access for rudolphampofo [puppet] - 10https://gerrit.wikimedia.org/r/1075356 [06:48:01] (03PS1) 10Slyngshede: Minor UI tweaks. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075432 [06:50:49] (03PS1) 10Slyngshede: P:idm Gitlab API is https. [puppet] - 10https://gerrit.wikimedia.org/r/1075433 [06:51:40] !log installing gnutls security updates on bullseye/bookworm [06:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:59] (03CR) 10Alexandros Kosiaris: [C:03+1] sre.discovery.datacenter: exclude kibana7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [06:53:48] (03PS2) 10Slyngshede: Minor UI tweaks, fix Gerrit blocking bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075432 [06:54:27] (03CR) 10Slyngshede: [C:03+2] P:idm Gitlab API is https. [puppet] - 10https://gerrit.wikimedia.org/r/1075433 (owner: 10Slyngshede) [07:00:05] Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T0700). Please do the needful. [07:00:06] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:00] I can deploy abijeet's patch. [07:01:11] hello [07:01:35] abijeet: hola. I'll ping once patch is to test on mwdebug servers. [07:01:39] 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Print allowed aliases in help message - https://phabricator.wikimedia.org/T375590 (10MoritzMuehlenhoff) 03NEW [07:01:42] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [07:01:46] 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Print allowed aliases in help message - https://phabricator.wikimedia.org/T375590#10174190 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:01:47] kart_, thanks! [07:02:19] (03CR) 10Brouberol: [C:03+1] "Perfect" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [07:02:23] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [07:02:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:03:37] (03Merged) 10jenkins-bot: Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:04:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1075432 (owner: 10Slyngshede) [07:04:20] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1072166|Enable message group subscription feature for Test Wikipedia (T372386)]] [07:04:25] (03CR) 10Slyngshede: [C:03+2] Minor UI tweaks, fix Gerrit blocking bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075432 (owner: 10Slyngshede) [07:04:27] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [07:06:35] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1072166|Enable message group subscription feature for Test Wikipedia (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:07:27] (03CR) 10Jelto: [C:03+1] "lgtm. When this behaves as expected in WMCS we can also remove this config for the production gitlab-runners" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [07:08:57] abijeet: available on mwdebug for testing. [07:09:12] kart_, checking [07:11:35] (03CR) 10Jelto: [C:03+2] Revert "prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [07:17:45] abijeet: all OK? [07:18:46] kart_, not seeing the button to subscribe appear. [07:19:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [07:19:57] oh! [07:24:01] (03PS1) 10Slyngshede: Undo async setting for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075499 [07:25:07] abijeet: I see error: messagegroupsubscription: Failed to fetch user subscriptions internal_api_error_DBQueryError {action: 'query', list: 'messagegroupsubscription', formatversion: 2} [07:26:55] kart_, the configuration change seems to have been deployed fine. I'm doing some debugging...will ping you in a bit [07:27:12] mw.config.get( 'wgTranslateEnableMessageGroupSubscription' ) returns true [07:27:12] (03CR) 10Slyngshede: [C:03+2] Undo async setting for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075499 (owner: 10Slyngshede) [07:27:44] Yes [07:29:56] (03Merged) 10jenkins-bot: Undo async setting for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075499 (owner: 10Slyngshede) [07:31:21] kart_, I see a database query error in the console. Lets revert the change [07:32:22] OK! [07:32:52] !log kartik@deploy1003 Sync cancelled. [07:32:59] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw [07:33:39] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for rudolphampofo [puppet] - 10https://gerrit.wikimedia.org/r/1075356 (owner: 10Muehlenhoff) [07:33:46] (03PS1) 10TrainBranchBot: Revert "Enable message group subscription feature for Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075501 [07:33:46] (03CR) 10TrainBranchBot: "kartik@deploy1003 created a revert of this change as I6f78b7a102ae9f6507e54866b7824fa82eafad5b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:34:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw [07:34:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075501 (owner: 10TrainBranchBot) [07:34:32] kart_, thanks! [07:35:10] (03Merged) 10jenkins-bot: Revert "Enable message group subscription feature for Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075501 (owner: 10TrainBranchBot) [07:35:29] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1075501|Revert "Enable message group subscription feature for Test Wikipedia"]] [07:36:40] (03CR) 10Vgutierrez: [C:04-1] "this should be split by service and the CDN shouldn't be a part of it since we are doing a progressive deprecation there." [puppet] - 10https://gerrit.wikimedia.org/r/1075326 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [07:36:44] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-eqiad [07:37:26] !log kartik@deploy1003 kartik, trainbranchbot: Backport for [[gerrit:1075501|Revert "Enable message group subscription feature for Test Wikipedia"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:37:36] !log kartik@deploy1003 kartik, trainbranchbot: Continuing with sync [07:37:40] !log restarting slapd on r/w LDAP servers to pick up GNUTLS security updates [07:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-eqiad [07:39:01] Interesting, if we deploy revert changes only deployed to mwdebug servers with scap backport --revert, do we need to do full deployment? :) [07:40:43] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10174244 (10MoritzMuehlenhoff) [07:42:18] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075501|Revert "Enable message group subscription feature for Test Wikipedia"]] (duration: 06m 48s) [07:43:24] (03PS1) 10Elukey: docker_registry_ha: configure the max entries returned for catalog reqs [puppet] - 10https://gerrit.wikimedia.org/r/1075502 [07:44:27] (03CR) 10Ayounsi: [C:03+2] Allow prometheus hosts to reach gnmi port [homer/public] - 10https://gerrit.wikimedia.org/r/1075237 (owner: 10Ayounsi) [07:44:32] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4116/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [07:44:59] (03Merged) 10jenkins-bot: Allow prometheus hosts to reach gnmi port [homer/public] - 10https://gerrit.wikimedia.org/r/1075237 (owner: 10Ayounsi) [07:45:02] (03CR) 10CI reject: [V:04-1] docker_registry_ha: configure the max entries returned for catalog reqs [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [07:45:04] (03PS1) 10Slyngshede: Bump Debian package version to 0.0.9. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075503 [07:45:05] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: use SSL and external-services fqdn to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [07:46:06] (03PS2) 10Elukey: docker_registry_ha: configure the max entries returned for catalog reqs [puppet] - 10https://gerrit.wikimedia.org/r/1075502 [07:46:08] (03Merged) 10jenkins-bot: rdf-streaming-updater: use SSL and external-services fqdn to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [07:47:23] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [07:47:57] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [07:48:06] (03CR) 10Hashar: [C:03+1] "I have a simple change I can deploy to validate everything works fine: https://gerrit.wikimedia.org/r/c/integration/docroot/+/1071197" [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [07:51:21] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [07:51:50] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [07:52:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1075503 (owner: 10Slyngshede) [07:52:18] (03PS1) 10Jelto: gitlab: test defs_from_etcd on the replica [puppet] - 10https://gerrit.wikimedia.org/r/1075504 (https://phabricator.wikimedia.org/T366882) [07:54:20] (03CR) 10Muehlenhoff: [C:03+2] deployment servers: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [07:54:41] (03CR) 10Slyngshede: [C:03+2] Bump Debian package version to 0.0.9. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075503 (owner: 10Slyngshede) [07:54:54] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1075504 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [07:55:40] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [07:55:48] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [07:57:04] (03Merged) 10jenkins-bot: Bump Debian package version to 0.0.9. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075503 (owner: 10Slyngshede) [07:58:10] !log running REPLACE into dtpwiki db2123 (s5) T375507 [07:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:53] (03CR) 10JMeybohm: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [07:59:56] (03CR) 10Elukey: "Tested live on registry1005, and it works:" [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [08:00:04] (03CR) 10Elukey: [C:03+2] docker_registry_ha: configure the max entries returned for catalog reqs [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [08:00:05] brennen and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T0800). [08:04:56] FIRING: SystemdUnitFailed: build-homepage.service on registry1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:19] !log hashar@deploy1003 Started deploy [integration/docroot@0482d53]: zuul: show change queues window value [08:06:26] !log hashar@deploy1003 Finished deploy [integration/docroot@0482d53]: zuul: show change queues window value (duration: 00m 07s) [08:06:31] !log set max-catalog-entries (changes the default catalog pagination) to 50 for docker-registry - T348876 [08:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:37] T348876: Container image reports in debmonitor are broken - https://phabricator.wikimedia.org/T348876 [08:07:14] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10174363 (10MoritzMuehlenhoff) [08:07:24] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10174367 (10MoritzMuehlenhoff) [08:09:23] (03PS1) 10Muehlenhoff: scap_proxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075507 [08:09:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075507 (owner: 10Muehlenhoff) [08:12:33] (03PS1) 10Brouberol: Deploy an airflow-scheduler ClusteRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [08:14:52] (03PS2) 10Santiago Faci: MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) [08:14:56] FIRING: [2x] SystemdUnitFailed: build-homepage.service on registry1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:17] (03PS3) 10Santiago Faci: MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) [08:27:34] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging DNdubane out of all services on: 1540 hosts [08:28:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging DNdubane out of all services on: 1540 hosts [08:28:09] (03CR) 10Hashar: [C:03+1] scap_proxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075507 (owner: 10Muehlenhoff) [08:28:12] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging DNdubane out of all services on: 700 hosts [08:28:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging DNdubane out of all services on: 700 hosts [08:29:32] (03PS1) 10Muehlenhoff: Remove access for dumisani [puppet] - 10https://gerrit.wikimedia.org/r/1075509 [08:32:15] (03CR) 10Slyngshede: [C:03+1] "Looks good. Needs to be removed from wmf ldap group as well." [puppet] - 10https://gerrit.wikimedia.org/r/1075509 (owner: 10Muehlenhoff) [08:33:49] (03CR) 10Muehlenhoff: [C:03+2] Remove access for dumisani [puppet] - 10https://gerrit.wikimedia.org/r/1075509 (owner: 10Muehlenhoff) [08:39:05] (03PS1) 10Elukey: docker_registry_ha: reduce maxentries' default to 25 [puppet] - 10https://gerrit.wikimedia.org/r/1075510 (https://phabricator.wikimedia.org/T348876) [08:42:42] (03CR) 10FNegri: [C:03+1] [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [08:47:59] (03Abandoned) 10Muehlenhoff: Revert "No longer include config-master on Puppet 5 frontends" [puppet] - 10https://gerrit.wikimedia.org/r/1074994 (owner: 10Muehlenhoff) [08:48:58] (03Abandoned) 10Muehlenhoff: P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [08:49:56] FIRING: [2x] SystemdUnitFailed: build-homepage.service on registry1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:03] (03CR) 10Jcrespo: [C:03+1] "Looks ok sql-wise, but probably someone more familiar with mediawiki (security or engineering) should give an ok to the final view once de" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [08:50:54] (03PS2) 10Brouberol: Deploy an airflow-scheduler SA/Role/Rolebinding to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [08:52:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:53:32] (03CR) 10Arnaudb: [C:03+1] hiera: specify cluster for apus nodes [puppet] - 10https://gerrit.wikimedia.org/r/1075027 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [08:54:56] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:56] (03CR) 10CI reject: [V:04-1] Deploy an airflow-scheduler SA/Role/Rolebinding to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [08:55:44] (03CR) 10Btullis: Deploy an airflow-scheduler SA/Role/Rolebinding to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [08:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:59:09] (03CR) 10Brouberol: Deploy an airflow-scheduler SA/Role/Rolebinding to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [09:01:03] (03CR) 10Elukey: [C:03+2] docker_registry_ha: reduce maxentries' default to 25 [puppet] - 10https://gerrit.wikimedia.org/r/1075510 (https://phabricator.wikimedia.org/T348876) (owner: 10Elukey) [09:01:56] (03CR) 10MVernon: [C:03+2] hiera: specify cluster for apus nodes [puppet] - 10https://gerrit.wikimedia.org/r/1075027 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:03:03] (03PS1) 10Slyngshede: P:idm Add empty ACCESS_REQUEST_RULES to production. [puppet] - 10https://gerrit.wikimedia.org/r/1075512 [09:06:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075512 (owner: 10Slyngshede) [09:06:58] (03CR) 10Slyngshede: [C:03+2] P:idm Add empty ACCESS_REQUEST_RULES to production. [puppet] - 10https://gerrit.wikimedia.org/r/1075512 (owner: 10Slyngshede) [09:07:24] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=apus,name=codfw [09:09:56] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:23] (03PS1) 10Slyngshede: IDM: Switch to upgraded IDM host. [dns] - 10https://gerrit.wikimedia.org/r/1075513 [09:11:51] !log set max-catalog-entries (changes the default catalog pagination) to 25 for docker-registry - T348876 [09:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:58] T348876: Container image reports in debmonitor are broken - https://phabricator.wikimedia.org/T348876 [09:12:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1075513 (owner: 10Slyngshede) [09:13:43] (03CR) 10Slyngshede: [C:03+2] IDM: Switch to upgraded IDM host. [dns] - 10https://gerrit.wikimedia.org/r/1075513 (owner: 10Slyngshede) [09:17:54] (03CR) 10Muehlenhoff: [C:03+2] puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:19:22] !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 180 days, 0:00:00 on cloudcephosd1025.eqiad.wmnet with reason: Getting the disks shipped to dell T348643 [09:19:28] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [09:19:36] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 180 days, 0:00:00 on cloudcephosd1025.eqiad.wmnet with reason: Getting the disks shipped to dell T348643 [09:19:52] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10174599 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=087b480f-3f34-4877-a07a-3baa2b98f863) s... [09:20:14] 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Print allowed aliases in help message - https://phabricator.wikimedia.org/T375590#10174602 (10Volans) With the current API that's not possible because `allowed_aliases` is an instance property (not a class property) of the runner class, not the cookbook... [09:22:54] !log Upgrade idm2001 to Bitu version 0.0.9 [09:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1038.eqiad.wmnet [09:26:21] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:12] (03PS1) 10Muehlenhoff: Switch cloudcephosd1038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075514 (https://phabricator.wikimedia.org/T349619) [09:30:46] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075514 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:34:39] (03Abandoned) 10Muehlenhoff: Stop including profile::configmaster in puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/1007363 (https://phabricator.wikimedia.org/T341717) (owner: 10Muehlenhoff) [09:35:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1038.eqiad.wmnet [09:36:31] (03PS1) 10Filippo Giunchedi: vopsbot: remove systemd::service alert, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1075515 (https://phabricator.wikimedia.org/T321808) [09:36:32] (03PS1) 10Filippo Giunchedi: icinga: replace url checks with pingthing [puppet] - 10https://gerrit.wikimedia.org/r/1075516 (https://phabricator.wikimedia.org/T321808) [09:36:49] (03CR) 10CI reject: [V:04-1] vopsbot: remove systemd::service alert, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1075515 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [09:39:33] !log installing distro-info-data updates from bullseye/bookworm point updates [09:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:04] (03PS2) 10Filippo Giunchedi: vopsbot: remove systemd::service alert, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1075515 (https://phabricator.wikimedia.org/T321808) [09:40:05] (03PS2) 10Filippo Giunchedi: icinga: replace url checks with pingthing [puppet] - 10https://gerrit.wikimedia.org/r/1075516 (https://phabricator.wikimedia.org/T321808) [09:43:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10174729 (10MoritzMuehlenhoff) [09:45:32] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10174735 (10MoritzMuehlenhoff) [09:48:39] (03PS1) 10Mvolz: Revert^2 "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075517 [09:49:43] (03PS2) 10Mvolz: Revert^2 "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075517 (https://phabricator.wikimedia.org/T361728) [09:49:56] RESOLVED: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:54:19] (03PS3) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [09:54:56] RESOLVED: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:25] (03CR) 10CI reject: [V:04-1] Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1000) [10:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1000). [10:01:48] akosiaris: o/ [10:04:08] (03CR) 10Hnowlan: [C:03+1] "I'm also a little short on the history but it seems like this is very safe to do." [puppet] - 10https://gerrit.wikimedia.org/r/1075152 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:04:15] (03PS4) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [10:06:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1025.eqiad.wmnet [10:07:45] (03PS1) 10Muehlenhoff: Switch cloudcephosd1025 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075519 (https://phabricator.wikimedia.org/T349619) [10:08:09] akosiaris: you around? [10:09:04] (03CR) 10Tiziano Fogli: [C:03+2] ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:10:45] (03Merged) 10jenkins-bot: ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:11:46] mvolz: yes [10:12:04] sorry, I was drafting an email. [10:12:09] npn [10:12:12] np* [10:13:55] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:04] so I'm thinking we +2 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1075517, we deploy staging, then we either depool codfw or equiad and then deploy there? [10:14:15] eqiad is depooled right now anyway [10:14:16] I assume the least active one should be depooled? [10:14:19] oh it is? [10:14:24] this is the switchover week [10:14:33] it was done yesterday ~at 16:00UTC [10:14:35] oooohhhh [10:14:57] so actually this is a good week for it then. [10:15:00] you should be seeing 0 traffic in graphs btw (health checks aside) [10:15:15] (03CR) 10Alexandros Kosiaris: [C:03+1] Revert^2 "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075517 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [10:15:46] will we see alerts though? [10:15:55] yes [10:16:16] i can see those still look up for eqiad. okay... should we disable them or something to avoid pinging on call? [10:16:39] (I'll do staging now) [10:16:42] I can do that, gimme a sec [10:16:49] ok [10:17:01] both zotero AND citoid, right? [10:17:06] only zotero [10:17:10] ok [10:17:11] citoid we're not touching [10:17:23] altohugh do the alerts give useful info? [10:17:27] probably not. [10:17:31] it might alert because of zotero which is a dependency? [10:17:45] in any case, I 'll disable the paging ones, we 'll see the non paging ones [10:17:57] does eqiad citoid contact eqiad zotero? [10:18:07] wouldn't it just use whichever one is pooled? or not? [10:18:14] ah, right now, no it doesn't. you are right [10:19:03] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [10:19:23] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:21:41] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1025 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075519 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:22:08] I see the pod is up. no logs ofc for zotero, the standard envoy startup logs for the tls-proxy sidecar container [10:22:19] what's the test curl call again? [10:22:20] (03PS1) 10Gmodena: mw-page-content-change-enrich: enable claico network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075520 (https://phabricator.wikimedia.org/T373195) [10:22:33] https://wikitech.wikimedia.org/wiki/Zotero/Deploying_zotero#Staging_server [10:22:38] I tried both of the samples there [10:22:42] they responded just fine. [10:23:42] well, off to eqiad then ? [10:23:55] sure, will do [10:24:10] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [10:24:35] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [10:25:05] oh no it's working [10:25:13] lol [10:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1025.eqiad.wmnet [10:25:57] curl -k -d 'https://en.wikipedia.org/wiki/Darth_Vader' -H 'Content-Type: text/plain' https://zotero.svc.eqiad.wmnet:4969/web I think is even the probe query that goes down [10:25:59] (03CR) 10Btullis: Specify a custom deploy clusterrole for airflow namespaces in dse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [10:28:24] (03CR) 10Gmodena: mw-page-content-change-enrich: enable claico network policies. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075520 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [10:28:47] mvolz: can't say I can reproduce. It's consistently returning ok? [10:29:34] takes a while but on every call I see [{"key":"TYBXVFQ6","version":0 yada yada yada [10:29:42] Yeah works fine for me too [10:31:31] James_F recently posted something about how there are errors for the tls proxies or something? [10:32:01] alerts about them being at times close to memory limits [10:32:20] (03PS1) 10Abijeet Patro: Revert^2 "Enable message group subscription feature for Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075522 [10:32:52] which we can bump ofc, but there should be something more (like a log line) that points out that's the issue [10:33:06] wanna go for codfw? [10:33:22] at least at that point we 'll know [10:33:39] sure but I will bet you it alerts :P [10:33:59] I can also repool eqiad for a while [10:34:14] if we think it's traffic related, that should point to something [10:34:46] so it's going to Alert citoid, because citoid is the proxy for zotero [10:34:56] basically swagger checks whether the response came from zotero [10:35:13] (03PS2) 10Abijeet Patro: Translate: Add VirtualDomainsMapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075521 (https://phabricator.wikimedia.org/T372287) [10:35:21] since citoid (ideally) has a native scraping option, then it'll 200 but it won't be from zotero, and then will alert. [10:36:06] (03CR) 10Daniel Kinzler: [C:03+1] "yes, please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075335 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [10:36:39] mvolz: let's see, it's easy enough to depool eqiad anyway [10:36:53] I 'll only pool zotero, not both [10:37:17] ok, let me know when you're done repooling [10:38:44] !log akosiaris@cumin1002 START - Cookbook sre.discovery.service-route pool zotero in eqiad: maintenance [10:41:58] (03CR) 10Ladsgroup: [C:03+1] Translate: Add VirtualDomainsMapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075521 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [10:42:54] (03CR) 10Sg912: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:43:10] I can see some 500s [10:43:26] akosiaris: where are you looking? [10:43:39] (03CR) 10Santiago Faci: [C:03+2] MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:43:39] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-kubernetes_namespace=zotero&var-kubernetes_namespace=citoid&var-app=All&var-destination=zotero&from=now-15m&to=now [10:43:45] nothing out of the ordinary though [10:43:48] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool zotero in eqiad: maintenance [10:44:35] it's done btw [10:44:38] (03Merged) 10jenkins-bot: MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:44:40] yeah it looks normal [10:44:47] 500s are kind of normal for Zotero [10:44:59] it reports a lot of things that are more like 4xx as 500 [10:45:51] well, deploy to codfw I 'd say? by now I don't know much more that we can do to gain more confidence [10:46:16] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [10:46:37] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [10:47:07] I'm a little confused, in grafana the number of requests don't seem to show any decrease from the depooling? [10:47:26] which graph are you looking at? [10:47:37] the linked envoy telemetry one [10:48:08] oh nevermind that's because it's codfw [10:48:10] haha [10:48:42] ah wait, there is 1 more thing I can do [10:48:47] I 'll depool zotero in codfw [10:48:49] gimme a sec [10:48:58] in which case I don't see an increase in requests after re-pooling [10:49:06] (03PS1) 10Samtar: IS-labs: Enable wgUseCodexSpecialBlock on test.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075523 (https://phabricator.wikimedia.org/T375610) [10:49:09] !log akosiaris@cumin1002 START - Cookbook sre.discovery.service-route depool zotero in codfw: maintenance [10:49:10] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache zotero.discovery.wmnet on all recursors [10:49:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zotero.discovery.wmnet on all recursors [10:50:34] (03PS1) 10Santiago Faci: MPIC: Deploying to production a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075524 (https://phabricator.wikimedia.org/T373473) [10:52:29] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-kubernetes_namespace=zotero&var-app=All&var-destination=All&from=now-30m&to=now [10:52:34] (03CR) 10Sg912: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075524 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:52:38] traffic to eqiad zotero is definitely increasing [10:52:54] it's still 0.5rps, but it is what it is [10:53:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070562 (https://phabricator.wikimedia.org/T369069) (owner: 10Sergio Gimeno) [10:53:24] max apparently in the last 2 days has been ~4 [10:53:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1037.eqiad.wmnet [10:54:08] i don't see it in envoy? [10:54:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool zotero in codfw: maintenance [10:54:15] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-kubernetes_namespace=zotero&var-app=zotero&var-destination=local_service&from=now-30m&to=now [10:54:21] this ^ is codfw dropping [10:54:32] switch to eqiad on the dropdown and it should be increasing [10:54:43] (03PS1) 10Muehlenhoff: Switch cloudcephosd1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075526 (https://phabricator.wikimedia.org/T349619) [10:54:47] ok [10:54:59] this dashboard could be done a bit better, but I 'll file that for later [10:55:26] I had it set to zotero and not local service [10:55:30] not sure what the difference is :P [10:55:33] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075526 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:56:30] still looking okay to me, you? [10:56:33] (03CR) 10Santiago Faci: [C:03+2] MPIC: Deploying to production a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075524 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:56:36] "zotero" is as citoid side sees it. local_service is as the local envoy sees it [10:56:46] (03CR) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [10:56:53] (03CR) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [10:56:59] but since all service have a "local_service" you need to take a bit of care to only pick the proper service [10:57:33] (03Merged) 10jenkins-bot: MPIC: Deploying to production a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075524 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:57:40] so, yeah depooling of codfw and full repooling of eqiad has happened and I see 0 worrying things up to now [10:57:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075521 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [10:57:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [10:58:02] Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [10:58:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075522 (owner: 10Abijeet Patro) [10:58:20] (03PS5) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [10:59:13] (03PS6) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [10:59:21] mvolz: you promised me alerts and I see none :P [10:59:32] i'm sorry :( [10:59:49] I would ask what it is that you witnessed last time, but with zotero not emitting either metrics or logs that would be pointless [10:59:50] the only thing that changed since then is envoy has been updated. [11:00:08] the last time the swagger probe started failing [11:00:11] jouncebot: now [11:00:11] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [11:00:16] jouncebot: next [11:00:17] In 1 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1300) [11:00:46] https://phabricator.wikimedia.org/T361728 [11:00:58] (03Abandoned) 10Brouberol: modules: release a new version of app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021843 (https://phabricator.wikimedia.org/T362954) (owner: 10Brouberol) [11:01:16] yeah all we know is zotero just wasn't responding [11:01:19] or was giving errors [11:01:23] one of those two [11:01:33] and citoid was like nvm i'll do it myself [11:02:49] Actually jon said something about tls terminator? [11:02:56] RESOLVED: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [11:03:01] Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:03:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1037.eqiad.wmnet [11:03:58] mvolz: so, zotero codfw right now is depooled. Wanna try upgrading there? [11:04:06] sure [11:04:21] let's run some tests there too and then I 'll pool it, then wait it out and then depool eqiad again [11:04:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1036.eqiad.wmnet [11:04:50] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [11:05:19] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:06:19] (03PS1) 10Muehlenhoff: Switch cloudcephosd1036 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075527 (https://phabricator.wikimedia.org/T349619) [11:07:20] Looks okay again. [11:08:07] (03PS1) 10Elukey: profile::trafficserver::backend: change timeouts for the docker registry [puppet] - 10https://gerrit.wikimedia.org/r/1075528 (https://phabricator.wikimedia.org/T242604) [11:08:21] οκ, repooling codfw then [11:08:32] !log akosiaris@cumin1002 START - Cookbook sre.discovery.service-route pool zotero in codfw: maintenance [11:08:33] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache zotero.discovery.wmnet on all recursors [11:08:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zotero.discovery.wmnet on all recursors [11:08:53] which alerts were turned off? was it just the eqiad ones? [11:09:03] (03PS1) 10Gmodena: mw-page-content-change-enrich: disable legary network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075529 (https://phabricator.wikimedia.org/T373195) [11:09:25] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1036 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075527 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:09:31] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4118/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075528 (https://phabricator.wikimedia.org/T242604) (owner: 10Elukey) [11:09:54] (03PS2) 10Gmodena: mw-page-content-change-enrich: disable legacy network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075529 (https://phabricator.wikimedia.org/T373195) [11:10:35] !log running UPDATE into viwiki db2218 (s7) T375507 [11:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [11:10:56] Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:11:19] (03CR) 10Elukey: [V:03+1] "Hi folks! I have no idea if this is the preferred/best way forward on the ATS side, lemme know if you feel differently." [puppet] - 10https://gerrit.wikimedia.org/r/1075528 (https://phabricator.wikimedia.org/T242604) (owner: 10Elukey) [11:11:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudcephosd1036.eqiad.wmnet [11:11:52] mvolz: none right now, they 've expired [11:11:59] ok [11:13:12] (03CR) 10Gmodena: [C:03+2] dse-k8s-services: fix values in dump enrichment app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075226 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [11:13:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool zotero in codfw: maintenance [11:13:52] this is ridiculous https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/1075530 [11:14:21] (03Merged) 10jenkins-bot: dse-k8s-services: fix values in dump enrichment app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075226 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [11:15:07] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:15:09] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:15:11] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:15:17] ok, depooling eqiad once more [11:15:24] zotero@eqiad* [11:15:34] !log akosiaris@cumin1002 START - Cookbook sre.discovery.service-route depool zotero in eqiad: maintenance [11:15:36] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache zotero.discovery.wmnet on all recursors [11:15:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zotero.discovery.wmnet on all recursors [11:16:22] akosiaris: why? [11:16:58] it's switchover week. The intent is to see for a week (until next Wednesday) whether we 'd survive on a single DC [11:17:04] oh, ok [11:17:07] sorry, forgot haha [11:17:09] and also perform maintenance on various things in eqiad [11:17:14] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: test defs_from_etcd on the replica [puppet] - 10https://gerrit.wikimedia.org/r/1075504 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [11:17:15] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:17:18] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:17:33] for a moment i had thought it was misbehaving somehow and i'd missed it [11:20:11] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:20:37] mvolz: I am gonna call it done and go have lunch. Ping me if anything goes wrong, but I 'll probably see pages quick enough [11:20:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool zotero in eqiad: maintenance [11:21:25] akosiaris: ok great, thanks, sorry for no alerts ;) [11:22:05] (03CR) 10Jelto: [V:03+1 C:03+2] "this fails because there is a IPv6 address in the IPv4 requestctl.nft set:" [puppet] - 10https://gerrit.wikimedia.org/r/1075504 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [11:22:30] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:22:33] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:22:57] :D [11:23:36] (03PS2) 10Abijeet Patro: Revert^2 "Enable message group subscription feature for Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075522 (https://phabricator.wikimedia.org/T372386) [11:23:41] (03PS1) 10Jelto: Revert "gitlab: test defs_from_etcd on the replica" [puppet] - 10https://gerrit.wikimedia.org/r/1075533 (https://phabricator.wikimedia.org/T366882) [11:30:11] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:30:24] !log running DELETE + REPLACE on kowiki db2218 (s7) T375186 [11:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:30] T375186: databases preswitchover checks - https://phabricator.wikimedia.org/T375186 [11:30:45] (03PS1) 10Effie Mouzeli: mw-mcrouter: double number of threads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075534 [11:31:26] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:31:26] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:33:49] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: double number of threads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075534 (owner: 10Effie Mouzeli) [11:34:18] (03CR) 10Jelto: [C:03+2] Revert "gitlab: test defs_from_etcd on the replica" [puppet] - 10https://gerrit.wikimedia.org/r/1075533 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [11:34:47] (03Merged) 10jenkins-bot: mw-mcrouter: double number of threads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075534 (owner: 10Effie Mouzeli) [11:35:11] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:35:47] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [11:35:54] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [11:36:00] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [11:36:33] !log updating mw-mcrouter [11:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:44] (03PS1) 10Brouberol: flink-operator: specify a list of NS to watch in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075537 (https://phabricator.wikimedia.org/T368787) [11:37:45] (03CR) 10Gmodena: [C:03+1] "LGTM. Reviewed over a pair programming session." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075537 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [11:38:57] 10SRE-tools, 06Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342#10175144 (10ayounsi) 05Open→03Resolved a:03ayounsi [11:40:00] !log running DELETE + REPLACE on metawiki db2218 (s7) T375186 [11:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:07] T375186: databases preswitchover checks - https://phabricator.wikimedia.org/T375186 [11:41:40] (03CR) 10Tacsipacsi: Revert^2 "Enable message group subscription feature for Test Wikipedia" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075522 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [11:41:45] (03CR) 10Brouberol: [C:03+2] flink-operator: specify a list of NS to watch in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075537 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [11:42:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:43:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:46:42] (03CR) 10Ayounsi: [C:03+2] gNMI prometheus check: add specific network CA cert [puppet] - 10https://gerrit.wikimedia.org/r/1075235 (https://phabricator.wikimedia.org/T375513) (owner: 10Ayounsi) [11:48:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:50:05] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [11:50:06] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [11:50:07] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [11:50:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:50:25] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [11:50:29] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [11:51:27] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [11:51:29] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [11:51:52] !log running REPLACE into cawiki db2218 (s7) T375507 [11:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:55:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:55:22] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:55:27] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:58:54] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [11:58:58] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [12:00:55] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:00:59] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:01:45] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:04:08] (03PS7) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [12:05:11] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:12:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:12:55] (03CR) 10Brouberol: [C:03+1] "Actually, we need to enable ingress from DSE_KUBEPODS_NETWORKS to the 8793 port (cf https://github.com/apache/airflow/blob/b9b7bfc6a8dff03" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [12:13:00] (03CR) 10Brouberol: airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [12:14:54] (03CR) 10Urbanecm: [C:03+1] "We were still A/B testing on beta? Ouch... thanks for noticing!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077) (owner: 10Cyndywikime) [12:15:09] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:15:11] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:15:57] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:16:26] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:20:11] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:21:26] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:21:37] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075245 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira) [12:22:22] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [12:22:36] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [12:22:41] (03Merged) 10jenkins-bot: ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075245 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira) [12:24:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1035.eqiad.wmnet [12:24:08] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10175282 (10dcaro) @wiki_willy Okok, the node is ready, I just shut it down and created a downtime for 180 days, you... [12:26:06] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:27:58] (03PS1) 10Muehlenhoff: Switch cloudcephosd1035 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075547 (https://phabricator.wikimedia.org/T349619) [12:28:09] (03PS1) 10Ayounsi: Prometheus gNMI check: use TCP check instead [puppet] - 10https://gerrit.wikimedia.org/r/1075548 (https://phabricator.wikimedia.org/T369384) [12:28:54] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075548 (https://phabricator.wikimedia.org/T369384) (owner: 10Ayounsi) [12:29:06] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:29:44] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:29:48] (03PS1) 10Slyngshede: Minor UI improvements. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075550 (https://phabricator.wikimedia.org/T375168) [12:30:14] (03CR) 10Filippo Giunchedi: [C:03+1] Prometheus gNMI check: use TCP check instead [puppet] - 10https://gerrit.wikimedia.org/r/1075548 (https://phabricator.wikimedia.org/T369384) (owner: 10Ayounsi) [12:30:21] (03CR) 10Ayounsi: [C:03+2] Prometheus gNMI check: use TCP check instead [puppet] - 10https://gerrit.wikimedia.org/r/1075548 (https://phabricator.wikimedia.org/T369384) (owner: 10Ayounsi) [12:31:26] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:31:31] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:31:55] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1035 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075547 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:35:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1035.eqiad.wmnet [12:37:10] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:37:13] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:38:24] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:38:27] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:40:11] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:42:58] (03PS1) 10David Caro: cloudcephosd: don't remove_os_md [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) [12:52:15] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-08-13-135124 to 2024-09-24-145528 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075554 (https://phabricator.wikimedia.org/T368654) [12:52:48] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-08-20-132618 to 2024-09-24-221243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075555 (https://phabricator.wikimedia.org/T363714) [12:53:26] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1300). [13:00:05] sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:29] hi [13:00:32] (03PS1) 10Jelto: profile::firewall: separate ipv4 and ipv6 in nftables BLOCKED_NETS [puppet] - 10https://gerrit.wikimedia.org/r/1075556 (https://phabricator.wikimedia.org/T348734) [13:02:54] (03PS1) 10Slyngshede: C:puppetmaster::scripts update public key for puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075557 [13:04:38] (03PS2) 10Slyngshede: C:puppetmaster::scripts update public key for puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075557 [13:05:47] I'm going to self-deploy [13:05:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070562 (https://phabricator.wikimedia.org/T369069) (owner: 10Sergio Gimeno) [13:06:36] (03Merged) 10jenkins-bot: Add wgCommunityConfigurationCommonsApiURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070562 (https://phabricator.wikimedia.org/T369069) (owner: 10Sergio Gimeno) [13:06:58] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1070562|Add wgCommunityConfigurationCommonsApiURL (T369069)]] [13:07:04] T369069: Create a dedicated component to select files from Commons - https://phabricator.wikimedia.org/T369069 [13:08:26] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:08:46] (03PS1) 10David Caro: blackbox_exporter: allow disabling gnmi and disable in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1075559 [13:09:06] !log sgimeno@deploy1003 sgimeno: Backport for [[gerrit:1070562|Add wgCommunityConfigurationCommonsApiURL (T369069)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:09:50] (03PS1) 10DCausse: rdf-streaming-updater: revert to using plaintext with kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075561 (https://phabricator.wikimedia.org/T333373) [13:09:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:10:19] !log sgimeno@deploy1003 sgimeno: Continuing with sync [13:10:42] continue with sync since its noop change [13:10:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:11:26] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:13:47] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [13:13:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:14:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:14:59] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:15:04] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070562|Add wgCommunityConfigurationCommonsApiURL (T369069)]] (duration: 08m 06s) [13:15:11] T369069: Create a dedicated component to select files from Commons - https://phabricator.wikimedia.org/T369069 [13:15:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Make sure to force a Puppet run on A:installserver-full after merging before you kick of the re-reimage of 1039." [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [13:16:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1075557 (owner: 10Slyngshede) [13:17:44] (03CR) 10Slyngshede: [C:03+2] C:puppetmaster::scripts update public key for puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075557 (owner: 10Slyngshede) [13:17:49] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: revert to using plaintext with kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075561 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [13:17:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:18:19] (03PS2) 10David Caro: blackbox_exporter: allow disabling gnmi and disable in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1075559 [13:18:49] (03Merged) 10jenkins-bot: rdf-streaming-updater: revert to using plaintext with kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075561 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [13:18:59] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:19:18] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:19:28] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [13:19:33] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:19:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:20:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.35s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:20:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:21:03] (03PS3) 10David Caro: blackbox_exporter: allow disabling gnmi and disable in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1075559 [13:21:17] <_joe_> uhm kartotherian [13:21:26] <_joe_> !incidents [13:21:27] 5278 (UNACKED) [2x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [13:21:27] 5277 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:21:27] 5276 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:21:31] (03CR) 10FNegri: cloudcephosd: don't remove_os_md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [13:21:57] <_joe_> !ack 5278 [13:21:57] 5278 (ACKED) [2x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [13:22:02] I'm surprised this even pages, I thought we had disabled that in the past [13:22:07] <_joe_> yes [13:22:21] <_joe_> the issue is that we didn't exclude it from the ats backend check [13:22:38] <_joe_> moritzm: I'm looking into it [13:22:42] kartotherian is struggling since midnight BTW [13:22:45] ack [13:22:47] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [13:23:04] https://grafana.wikimedia.org/goto/3e7XpygNR?orgId=1 [13:23:09] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [13:23:18] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [13:24:33] (03CR) 10David Caro: cloudcephosd: don't remove_os_md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [13:24:40] (03CR) 10Btullis: [C:03+1] "Lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [13:25:11] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:25:14] (03CR) 10Brouberol: [C:03+2] Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [13:25:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.35s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:25:43] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10175439 (10Papaul) [13:25:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:25:52] (03CR) 10David Caro: cloudcephosd: don't remove_os_md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [13:25:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2009.codfw.wmnet, maps2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:26:22] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [13:26:32] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [13:26:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:26:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:26:59] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2005.codfw.wmnet, maps2007.codfw.wmnet, maps2008.codfw.wmnet, maps2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:27:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:28:09] <_joe_> !log restart karthoterian on maps2005 [13:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:50] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10175445 (10Papaul) [13:28:59] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:29:07] <_joe_> uhm ok [13:29:16] <_joe_> looks like the farmer's method might solve things [13:29:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:29:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:29:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:29:58] (03CR) 10Muehlenhoff: Minor UI improvements. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1075550 (https://phabricator.wikimedia.org/T375168) (owner: 10Slyngshede) [13:30:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:31:01] (03PS1) 10Sbisson: CX3 Build 0.2.0+20240925 [extensions/ContentTranslation] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075567 (https://phabricator.wikimedia.org/T374387) [13:31:11] <_joe_> !log rolling restart of kartotherian in codfw [13:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:31:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:33:55] (03PS2) 10David Caro: cloudcephosd: don't remove_os_md [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) [13:34:04] (03PS3) 10Scott French: wmnet: update CNAME records for DB masters to codfw [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) [13:34:04] (03PS3) 10Scott French: wmnet: update CNAME record for maintenance host to codfw [dns] - 10https://gerrit.wikimedia.org/r/1073898 (https://phabricator.wikimedia.org/T370962) [13:34:04] (03PS3) 10Scott French: geo-maps: update map default to list codfw first [dns] - 10https://gerrit.wikimedia.org/r/1073899 (https://phabricator.wikimedia.org/T370962) [13:34:05] (03PS3) 10Scott French: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1073900 (https://phabricator.wikimedia.org/T370962) [13:34:08] (03PS2) 10Slyngshede: Minor UI improvements. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075550 (https://phabricator.wikimedia.org/T375168) [13:34:16] (03CR) 10Slyngshede: Minor UI improvements. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1075550 (https://phabricator.wikimedia.org/T375168) (owner: 10Slyngshede) [13:34:24] (03CR) 10David Caro: cloudcephosd: don't remove_os_md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [13:34:27] !incidents [13:34:28] 5278 (ACKED) [2x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [13:34:28] 5277 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:34:28] 5276 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:34:50] (03CR) 10David Caro: cloudcephosd: don't remove_os_md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [13:35:11] (03PS3) 10David Caro: cloudcephosd: don't remove_os_md [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) [13:35:29] (03CR) 10Muehlenhoff: cloudcephosd: don't remove_os_md (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [13:37:31] I'm backporting 1075567 as window is still on. [13:39:42] kart_ o/ [13:39:47] yo [13:39:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075567 (https://phabricator.wikimedia.org/T374387) (owner: 10Sbisson) [13:40:03] (03CR) 10Scott French: "Thanks, all, for the reviews!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [13:40:05] (03CR) 10Scott French: [C:03+2] sre.discovery.datacenter: exclude kibana7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [13:40:12] stephanebisson: some wait for CI now. I'll ping for testing after that. [13:40:36] ETA 23 minutes! [13:40:53] wow that's slow [13:41:18] (03CR) 10Herron: [C:03+1] zuul: send stats to prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [13:41:57] Wait. It is 30 minutes. I'll finish dinner ;) [13:42:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:42:58] (03CR) 10Ayounsi: [C:03+1] "lgtm! Adding Filippo to the review as well as he knows more about Prometheus than me." [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [13:45:06] (03CR) 10David Caro: cloudcephosd: don't remove_os_md (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075552 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [13:45:50] (03CR) 10David Caro: [V:03+1] blackbox_exporter: allow disabling gnmi and disable in cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [13:46:17] (03Abandoned) 10Andrea Denisse: smart: Refine data collection to differentiate RAID and non-RAID disks [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [13:46:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:47:00] <_joe_> !log repooling karthoterian in eqiad, a further roll restart in codfw [13:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:15] (03PS4) 10David Caro: blackbox_exporter: allow disabling gnmi and disable in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1075559 [13:47:23] (03CR) 10Filippo Giunchedi: "This works though having 'ops' prometheus instance shipping prometheus::blackbox::module { ...gnmi.. } and the cert file would be cleaner " [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [13:48:03] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1075516 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:49:38] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1075515 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:50:08] !log kartotherian repooled in eqiad due load issues - T370962 [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:15] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [13:50:43] (03CR) 10Herron: [C:03+1] "nice! happy to see pingthing come in handy" [puppet] - 10https://gerrit.wikimedia.org/r/1075516 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:51:24] (03CR) 10Alexandros Kosiaris: [C:03+1] "I think we can proceed with this revert. I 'll deploy tomorrow, unless there are any objections." [puppet] - 10https://gerrit.wikimedia.org/r/1072754 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:51:26] RESOLVED: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [13:51:32] Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:52:36] (03CR) 10Muehlenhoff: "Sounds good, I also also had planned to merge thus week, but it didn't bubble to the top of the queue yet." [puppet] - 10https://gerrit.wikimedia.org/r/1072754 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:53:50] (03CR) 10Herron: [C:03+1] uwsgi: remove icinga-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1074409 (https://phabricator.wikimedia.org/T375271) (owner: 10Filippo Giunchedi) [13:54:42] (03CR) 10Herron: [C:03+1] vopsbot: remove systemd::service alert, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1075515 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [13:55:18] (03Merged) 10jenkins-bot: sre.discovery.datacenter: exclude kibana7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [13:55:24] (03PS1) 10Elukey: docker_registry_ha: increase proxy timeouts to 300 [puppet] - 10https://gerrit.wikimedia.org/r/1075570 (https://phabricator.wikimedia.org/T242604) [13:55:51] RESOLVED: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:56:38] (03CR) 10David Caro: "That would mean needing to setup a secret with the cert for tools/metricsinfra/(any other project that has prometheus)?" [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1400) [14:00:05] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do Southward Datacenter Switchover: MediaWiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1400). [14:00:25] here o/ [14:01:23] (03CR) 10Filippo Giunchedi: [C:03+1] "I mean the prometheus 'ops' instance ships sth like" [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [14:01:23] deployers, FYI: we're going to start preparation for the switchover in ~ 20 minutes or so, at which time we'll be taking the scap lock. [14:01:34] The backport of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1075567 is still waiting to be merged BTW [14:01:59] kart_: For the above. [14:02:44] swfrench-wmf: I've deployment in progress. Sorry for the delay. CI still running :/ [14:03:21] zuul status says 10 mins to merge for that backport [14:03:24] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645#10175713 (10Ottomata) This would also help with some analysis in {T375146} [14:03:57] (03CR) 10David Caro: [C:03+2] "Aaaah, now I understand :)" [puppet] - 10https://gerrit.wikimedia.org/r/1075559 (owner: 10David Caro) [14:03:58] kart_: Dreamy_Jazz: ack, thanks for the heads up. [14:04:12] just ping me here when you're done :) [14:04:23] Sure swfrench-wmf! [14:08:30] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20240925 [extensions/ContentTranslation] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075567 (https://phabricator.wikimedia.org/T374387) (owner: 10Sbisson) [14:08:52] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1075567|CX3 Build 0.2.0+20240925 (T374387 T370746 T368422 T374567 T355780 T374559 T374886 T375410)]] [14:09:11] (03CR) 10Jcrespo: [C:03+1] pc2017: Set it to master [puppet] - 10https://gerrit.wikimedia.org/r/1075052 (https://phabricator.wikimedia.org/T374355) (owner: 10Ladsgroup) [14:09:15] T374387: Call to section recommendation API mysteriously failing - https://phabricator.wikimedia.org/T374387 [14:09:16] T370746: CX Unified Dashboard: Support suggestions based on previous edits - https://phabricator.wikimedia.org/T370746 [14:09:17] T368422: Custom translation suggestions: Basic topic selection - https://phabricator.wikimedia.org/T368422 [14:09:17] T374567: SX: Set aria-label to icon-only Codex buttons - https://phabricator.wikimedia.org/T374567 [14:09:18] T355780: SX: Refactor SFCs to use