[00:05:04] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005.eqiad.wmnet is down - https://phabricator.wikimedia.org/T421297#11752203 (10Jclark-ctr) 05Open→03Resolved Repeat of backplane communication event . Opened case with Dell Technologies . Dell reviewed logs and found no prior cases on record and no confirme... [00:17:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11752206 (10KFrancis) Hi all, the NDA is complete. Thanks! [00:17:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11752207 (10KFrancis) Hi all, the NDA is complete. Thanks! [00:17:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11752208 (10KFrancis) Hi all, the NDA is complete. Thanks! [00:18:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11752209 (10KFrancis) Hi all, the NDA is complete. Thanks! [00:33:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11752234 (10Scott_French) @KFrancis - Great, thank you! @katiamusiolekwmde - Please see T420459#11733750. I still don't see a Developer (LDAP) account associated with th... [00:33:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11752235 (10Scott_French) [00:42:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1260865 [00:42:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1260865 (owner: 10TrainBranchBot) [00:55:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1260865 (owner: 10TrainBranchBot) [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1260186 (owner: 10TrainBranchBot) [01:14:26] (03CR) 10Bartosz Dziewoński: [C:03+1] "Aside from the bug fix, I like splitting the logic for two quite different things into two files." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [01:45:38] PROBLEM - MariaDB Replica SQL: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:45:38] PROBLEM - MariaDB Replica IO: s3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:09:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:33] (03PS1) 10Scott French: admin: Add dariawmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260920 (https://phabricator.wikimedia.org/T420716) [02:34:35] (03PS1) 10Scott French: admin: Add kerenramirezwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260921 (https://phabricator.wikimedia.org/T420896) [02:34:38] (03PS1) 10Scott French: admin: Add alicem to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260922 (https://phabricator.wikimedia.org/T420751) [02:35:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11752333 (10Scott_French) [02:36:11] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11752334 (10Scott_French) Great, thank you! I should be able to wrap this up tomorrow. [02:36:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11752335 (10Scott_French) [02:37:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11752336 (10Scott_French) [02:39:22] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11752337 (10Scott_French) [03:15:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:20:38] RECOVERY - MariaDB Replica SQL: s3 on clouddb1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:20:38] RECOVERY - MariaDB Replica IO: s3 on clouddb1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:24:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1022.eqiad.wmnet with reason: Downgrade clouddb1022 to 10.11.13 [05:30:30] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:37:38] RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T0600) [06:00:05] marostegui, Amir1, and federico3: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T0600). [06:09:40] (03CR) 10Marostegui: [C:03+1] P:mariadb::ferm: Fix typing around port variable [puppet] - 10https://gerrit.wikimedia.org/r/1260717 (owner: 10Majavah) [06:10:14] (03CR) 10Marostegui: "Probably worth waiting until we are a bit more stable on https://phabricator.wikimedia.org/T420177 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri) [06:12:36] (03CR) 10Marostegui: [C:03+1] "Will you deploy this yourself?" [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [06:33:26] (03CR) 10Arnaudb: [C:03+2] gerrit: use Envoy on gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1259945 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [07:00:04] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:15:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:07] (03Abandoned) 10Arnaudb: gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256446 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [07:27:43] (03PS1) 10Ryan Kemper: Add sre.hadoop.reboot-coordinators cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261271 (https://phabricator.wikimedia.org/T421285) [07:33:11] (03CR) 10CI reject: [V:04-1] Add sre.hadoop.reboot-coordinators cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261271 (https://phabricator.wikimedia.org/T421285) (owner: 10Ryan Kemper) [07:37:00] (03CR) 10Ayounsi: "I'm not sure I see the relation between that change and the linked task." [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney) [07:39:20] 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11752514 (10ayounsi) This looks like a monitoring bug, the interface is properly named on the switch. Other metrics are not being collected properly as well. Most l... [07:46:56] (03PS2) 10Ryan Kemper: Add sre.hadoop.reboot-coordinators cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261271 (https://phabricator.wikimedia.org/T421285) [07:52:54] (03CR) 10Slyngshede: [C:03+1] admin: Add dariawmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260920 (https://phabricator.wikimedia.org/T420716) (owner: 10Scott French) [07:53:45] (03CR) 10Slyngshede: [C:03+1] admin: Add alicem to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260922 (https://phabricator.wikimedia.org/T420751) (owner: 10Scott French) [07:54:31] (03CR) 10Slyngshede: [C:03+1] admin: Add kerenramirezwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260921 (https://phabricator.wikimedia.org/T420896) (owner: 10Scott French) [08:00:05] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T0800) [08:00:10] \o [08:00:43] o/ [08:00:49] \o/ [08:01:05] we barely had any specific errors since yesterday [08:01:13] beside backend service failing and issueing 503 / timeouts etc [08:01:24] so I am proceeding [08:02:11] how boring [08:02:40] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261324 (https://phabricator.wikimedia.org/T420479) [08:02:43] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261324 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [08:03:44] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261324 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [08:04:40] pff [08:05:35] not sure why it is rebuilding the images :b [08:11:14] (03CR) 10Hashar: [C:04-1] jenkins: include docker, add comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260816 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [08:11:56] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.21 refs T420479 [08:12:01] T420479: 1.46.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T420479 [08:18:23] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T421278 [08:27:06] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T421278 [08:30:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:55] (03PS1) 10Elukey: services: bump up resources for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) [08:32:52] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T421278 [08:35:32] (03PS8) 10Federico Ceratto: Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 [08:38:10] (03PS9) 10Elukey: Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [08:39:26] (03PS2) 10Ayounsi: Add Nokia POPs BGP policies [homer/public] - 10https://gerrit.wikimedia.org/r/1260715 (https://phabricator.wikimedia.org/T408892) [08:41:00] (03CR) 10CI reject: [V:04-1] Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [08:41:00] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:41:30] (03CR) 10Elukey: [C:03+2] Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [08:42:11] (03CR) 10Elukey: Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [08:43:48] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:44:06] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:44:11] (03CR) 10CI reject: [V:04-1] Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [08:46:10] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:50:13] (03CR) 10Elukey: "The only weird thing that I see is that in CI py3-mypy is used, not the py311 one. I had to remove a comment that was not needed anymore f" [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [08:50:40] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:51:05] (03CR) 10Arnaudb: [C:03+2] gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [08:51:56] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [08:56:43] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:01:00] !log starting T416708 - disabling circular replication on core dbs [09:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:11] T416708: MariaDB Post-DC switchover tasks - https://phabricator.wikimedia.org/T416708 [09:01:46] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section x1 [09:02:08] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section x1 [09:03:06] (03CR) 10Jcrespo: "Yes, I will be doing this myself." [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [09:05:24] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:07:59] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section x3 [09:08:52] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section x3 [09:09:46] (03PS10) 10Elukey: Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [09:10:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [09:11:50] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [09:12:53] (03CR) 10Vgutierrez: [C:03+1] C:mtail extend trafficserver_backend_requests_seconds buckets [puppet] - 10https://gerrit.wikimedia.org/r/1247015 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [09:13:57] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [09:15:50] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section es6 [09:16:06] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section es6 [09:16:13] (03CR) 10Elukey: [C:03+2] "To keep archives happy - I have set the min version to py311 because tox-v3, the image that we use for CI, is based on bookwork." [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [09:16:59] (03CR) 10Slyngshede: [C:03+2] C:mtail extend trafficserver_backend_requests_seconds buckets [puppet] - 10https://gerrit.wikimedia.org/r/1247015 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [09:18:14] (03CR) 10Cathal Mooney: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1260715 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [09:18:42] (03Merged) 10jenkins-bot: Switch linting to ruff [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [09:18:58] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:22:17] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s6 [09:22:30] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [09:22:40] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s6 [09:26:37] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [09:29:29] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s5 [09:29:36] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:29:43] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s5 [09:30:46] aokoth@cumin1003 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade. [09:31:07] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: Set cert-manager leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) (owner: 10Btullis) [09:31:12] (03CR) 10Brouberol: [C:03+1] Update dse-k8s-eqiad to k8s 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [09:31:42] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:32:40] (03PS2) 10Brouberol: Update dse-k8s-eqiad to k8s 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [09:32:47] (03PS2) 10Jcrespo: mariadb: Update grants for new hosts ms-backup[12]00[34], which replace [12] [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464) [09:33:46] aokoth@cumin1003 upgrade (PID 2153261) is awaiting input [09:36:03] (03CR) 10FNegri: [C:04-1] "Yes let's put this on hold." [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri) [09:36:22] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s2 [09:36:37] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s2 [09:42:43] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:42:54] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:43:44] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s3 [09:43:56] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 116520 bytes in 1.999 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:44:00] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s3 [09:44:05] (03CR) 10Volans: "I saw this patch pass by and got nerd-sniped, let some comments, feel free to ignore them." [cookbooks] - 10https://gerrit.wikimedia.org/r/1261271 (https://phabricator.wikimedia.org/T421285) (owner: 10Ryan Kemper) [09:45:55] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:46:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:38] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T421278 [09:47:33] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:49:09] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:50:25] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:50:48] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s7 [09:51:03] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s7 [09:51:21] !log Stopping Gerrit on the replica / gerrit1003 to clear web sessions [09:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:13] !log Starting Gerrit on the replica / gerrit1003 [09:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:43] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:53:57] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-codfw [09:58:03] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s8 [09:58:17] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s8 [09:58:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259132 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1000) [10:03:19] (03PS1) 10Filippo Giunchedi: rabbitmq: enable cli tools peers communication [puppet] - 10https://gerrit.wikimedia.org/r/1261366 (https://phabricator.wikimedia.org/T420923) [10:05:20] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s4 [10:05:35] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s4 [10:11:56] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad for section s1 [10:12:09] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad for section s1 [10:17:54] (03PS1) 10Filippo Giunchedi: openstack: enable rabbit transient quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054) [10:19:07] (03CR) 10Btullis: [C:03+2] Temporarily suspend the flink applications running in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259973 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [10:21:17] (03Merged) 10jenkins-bot: Temporarily suspend the flink applications running in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259973 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [10:21:59] (03CR) 10Btullis: [C:03+2] Temporarily disable the deployment of mediawiki to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1260729 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [10:22:13] !log tappof@cumin1003 START - Cookbook sre.o11y.thanos-compact-restart [10:22:14] !log tappof@cumin1003 END (PASS) - Cookbook sre.o11y.thanos-compact-restart (exit_code=0) [10:23:36] !log tappof@cumin1003 START - Cookbook sre.o11y.thanos-compact-restart [10:23:37] !log tappof@cumin1003 END (PASS) - Cookbook sre.o11y.thanos-compact-restart (exit_code=0) [10:27:29] (03PS1) 10JavierMonton: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261377 (https://phabricator.wikimedia.org/T421341) [10:29:21] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:29:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:29:32] (03CR) 10David Caro: [C:03+1] "LGTM, we might want to test in codfw, including playing with rabbit nodes being down and such there too" [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi) [10:29:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:29:55] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:31:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [10:31:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [10:31:30] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:31:37] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:32:08] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [10:32:15] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [10:32:41] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [10:33:08] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [10:34:04] (03PS2) 10Majavah: P:mariadb::ferm: Fix typing around port variable [puppet] - 10https://gerrit.wikimedia.org/r/1260717 [10:34:45] (03CR) 10Filippo Giunchedi: "Indeed testing in codfw first SGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1261374 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi) [10:39:52] (03PS1) 10Jcrespo: mediabackup: Add region parameter for rclone [puppet] - 10https://gerrit.wikimedia.org/r/1261379 (https://phabricator.wikimedia.org/T420506) [10:41:48] (03CR) 10Mpostoronca: SI: Enable on bnwiki, itwiki, simplewiki, and plwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260802 (https://phabricator.wikimedia.org/T415529) (owner: 10Dreamy Jazz) [10:43:28] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 22): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8347/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1260717 (owner: 10Majavah) [10:44:46] (03CR) 10Majavah: [V:03+1 C:03+2] P:mariadb::ferm: Fix typing around port variable [puppet] - 10https://gerrit.wikimedia.org/r/1260717 (owner: 10Majavah) [10:46:18] (03PS2) 10Jcrespo: mediabackup: Add region parameter for rclone [puppet] - 10https://gerrit.wikimedia.org/r/1261379 (https://phabricator.wikimedia.org/T420506) [10:46:23] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261379 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [10:47:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [10:48:01] (03CR) 10Dreamy Jazz: SI: Enable on bnwiki, itwiki, simplewiki, and plwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260802 (https://phabricator.wikimedia.org/T415529) (owner: 10Dreamy Jazz) [10:48:02] (03PS2) 10Dreamy Jazz: SI: Enable on bnwiki, itwiki, simplewiki, and plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260802 (https://phabricator.wikimedia.org/T415529) [10:50:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8348/co" [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [10:51:34] (03CR) 10Majavah: [V:03+1 C:03+2] nftables::client: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [10:51:58] jouncebot: nowandnext [10:51:58] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1000) [10:51:58] In 1 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1200) [10:52:17] Anyone mind if I use scap? [10:52:32] (03CR) 10Jcrespo: [C:03+2] mediabackup: Add region parameter for rclone [puppet] - 10https://gerrit.wikimedia.org/r/1261379 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [10:53:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260802 (https://phabricator.wikimedia.org/T415529) (owner: 10Dreamy Jazz) [10:54:08] (03Merged) 10jenkins-bot: SI: Enable on bnwiki, itwiki, simplewiki, and plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260802 (https://phabricator.wikimedia.org/T415529) (owner: 10Dreamy Jazz) [10:54:50] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1260802|SI: Enable on bnwiki, itwiki, simplewiki, and plwiki (T415529)]] [10:54:55] T415529: Enable SuggestedInvestigations on bnwiki, itwiki, simplewiki, plwiki - https://phabricator.wikimedia.org/T415529 [10:55:01] (03CR) 10Brouberol: [C:03+2] Update dse-k8s-eqiad to k8s 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [10:56:58] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1260802|SI: Enable on bnwiki, itwiki, simplewiki, and plwiki (T415529)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:58:07] Can confirm it's available on plwiki ;) [10:59:31] Nice, just finished my checks to [10:59:34] *too [10:59:39] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [11:00:57] (03PS1) 10Elukey: services: remove maps1012 from tegola's and kartotherian's configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261382 (https://phabricator.wikimedia.org/T421350) [11:01:00] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1260720 (owner: 10Majavah) [11:01:21] (03CR) 10Filippo Giunchedi: [C:03+1] P:openstack: pdns::auth: Convert port to integer [puppet] - 10https://gerrit.wikimedia.org/r/1260720 (owner: 10Majavah) [11:01:44] (03PS2) 10Majavah: P:openstack: pdns::auth: Convert port to integer [puppet] - 10https://gerrit.wikimedia.org/r/1260720 [11:02:35] (03CR) 10Jgiannelos: [C:03+1] services: remove maps1012 from tegola's and kartotherian's configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261382 (https://phabricator.wikimedia.org/T421350) (owner: 10Elukey) [11:04:02] (03CR) 10Majavah: [C:03+2] P:openstack: pdns::auth: Convert port to integer [puppet] - 10https://gerrit.wikimedia.org/r/1260720 (owner: 10Majavah) [11:04:13] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1260802|SI: Enable on bnwiki, itwiki, simplewiki, and plwiki (T415529)]] (duration: 09m 23s) [11:04:18] T415529: Enable SuggestedInvestigations on bnwiki, itwiki, simplewiki, plwiki - https://phabricator.wikimedia.org/T415529 [11:07:31] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-codfw [11:09:19] (03CR) 10Elukey: [C:03+2] services: remove maps1012 from tegola's and kartotherian's configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261382 (https://phabricator.wikimedia.org/T421350) (owner: 10Elukey) [11:13:12] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [11:14:33] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [11:14:54] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [11:15:17] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [11:15:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:34] (03PS1) 10Jcrespo: mediabackup: Add resharding script using rclone [puppet] - 10https://gerrit.wikimedia.org/r/1261387 (https://phabricator.wikimedia.org/T420506) [11:19:15] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [11:19:36] (03CR) 10CI reject: [V:04-1] mediabackup: Add resharding script using rclone [puppet] - 10https://gerrit.wikimedia.org/r/1261387 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [11:22:02] !log elukey@cumin1003 START - Cookbook sre.postgresql.postgres-init [11:22:02] !log elukey@cumin1003 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [11:22:15] !log elukey@cumin1003 START - Cookbook sre.postgresql.postgres-init [11:22:21] (03PS2) 10Jcrespo: mediabackup: Add resharding script using rclone [puppet] - 10https://gerrit.wikimedia.org/r/1261387 (https://phabricator.wikimedia.org/T420506) [11:24:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet [11:29:07] (03CR) 10Cathal Mooney: "I guess I linked that task as it relates to our anycast strategy and how internal routing works." [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney) [11:31:01] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [11:31:19] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [11:31:23] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1202: Depool db1202.eqiad.wmnet to then clone it to db1253.eqiad.wmnet - fceratto@cumin1003 [11:31:33] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [11:32:12] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1202: Depool db1202.eqiad.wmnet to then clone it to db1253.eqiad.wmnet - fceratto@cumin1003 [11:33:07] (03PS2) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [11:33:14] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet [11:34:30] (03PS3) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [11:37:29] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [11:38:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet [11:39:38] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:41:21] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1001.eqiad.wmnet [11:41:33] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [11:41:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [11:41:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:42:23] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:42:40] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:42:58] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:43:24] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:43:36] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:44:08] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:47:05] (03PS4) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [11:47:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11753153 (10Jclark-ctr) [11:47:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet [11:49:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11753157 (10Jclark-ctr) @elukey i am having issues with these failing to provision I have sent passwords off tag to you [11:51:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1002.eqiad.wmnet [11:51:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1003.eqiad.wmnet [11:51:49] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1004.eqiad.wmnet [11:51:50] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1005.eqiad.wmnet [11:51:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1006.eqiad.wmnet [11:52:06] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:52:30] (03CR) 10Hnowlan: [C:03+2] admin: Add dariawmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260920 (https://phabricator.wikimedia.org/T420716) (owner: 10Scott French) [11:55:09] (03PS2) 10Scott French: admin: Add kerenramirezwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260921 (https://phabricator.wikimedia.org/T420896) [11:55:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11753174 (10Jclark-ctr) @herron these servers have arrived. Please update Puppet (both site.pp and preseed.yaml). All new servers are UEFI booting, so please ensure the ent... [11:55:28] (03PS2) 10CDanis: haproxy: CIDERGRINDER 🍎 to all drmrs 🚀 [puppet] - 10https://gerrit.wikimedia.org/r/1260771 [11:55:30] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260771 (owner: 10CDanis) [11:55:32] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [11:59:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11753179 (10hnowlan) 05Open→03In progress Merged, thank you @Scott_French! [11:59:24] (03CR) 10Hnowlan: [C:03+2] admin: Add kerenramirezwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260921 (https://phabricator.wikimedia.org/T420896) (owner: 10Scott French) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1200) [12:00:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1005.eqiad.wmnet [12:00:45] (03PS5) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [12:00:53] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1260771 (owner: 10CDanis) [12:00:54] (03PS2) 10Scott French: admin: Add alicem to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260922 (https://phabricator.wikimedia.org/T420751) [12:01:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11753187 (10hnowlan) 05Open→03In progress Merged, thank you Scott! [12:01:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet [12:01:59] jouncebot: nowandnext [12:01:59] For the next 0 hour(s) and 58 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1200) [12:01:59] In 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1300) [12:02:12] RECOVERY - Postgres Replication Lag on maps1012 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:02:45] !log elukey@cumin1003 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [12:03:15] (03PS6) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [12:03:15] (03PS6) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [12:03:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet [12:03:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1006.eqiad.wmnet [12:03:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet [12:04:49] (03CR) 10Hnowlan: [C:03+2] admin: Add alicem to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1260922 (https://phabricator.wikimedia.org/T420751) (owner: 10Scott French) [12:05:25] FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:26] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11753193 (10hnowlan) 05Open→03In progress Merged, thank you Scott! [12:05:48] (03CR) 10Tiziano Fogli: "See also: Ieaf8a85a79e7d4e110c48cfd82f7aed7eec08f98" [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [12:06:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11753196 (10hnowlan) 05Open→03Stalled [12:07:55] (03PS40) 10Tiziano Fogli: sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) [12:08:13] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11753201 (10hnowlan) 05Open→03Stalled [12:10:52] (03CR) 10Btullis: [C:03+2] dse-k8s-eqiad: Set cert-manager leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) (owner: 10Btullis) [12:12:36] !log 💔cdanis@cumin1003.eqiad.wmnet ~ 🕗☕ sudo cumin 'A:cp-drmrs' 'disable-puppet "cdanis CIDER"' [12:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:45] (03CR) 10CDanis: [C:03+2] haproxy: CIDERGRINDER 🍎 to all drmrs 🚀 [puppet] - 10https://gerrit.wikimedia.org/r/1260771 (owner: 10CDanis) [12:18:58] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [12:19:20] (03CR) 10CI reject: [V:04-1] dse-k8s-eqiad: Set cert-manager leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) (owner: 10Btullis) [12:23:30] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:25:51] (03CR) 10Effie Mouzeli: mcrouter: ease testing new cli parameters (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) (owner: 10Elukey) [12:26:50] (03PS2) 10Btullis: dse-k8s-eqiad: Set cert-manager leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) [12:29:03] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) (owner: 10Btullis) [12:29:16] (03CR) 10Btullis: [V:03+2 C:03+2] dse-k8s-eqiad: Set cert-manager leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) (owner: 10Btullis) [12:30:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:50] (03Merged) 10jenkins-bot: dse-k8s-eqiad: Set cert-manager leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) (owner: 10Btullis) [12:36:56] (03CR) 10Volans: "Nice to add a new cookbook! Did a quick pass, nothing major, just some questions/suggestions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [12:37:45] (03PS2) 10Btullis: Update dse-k8s-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) [12:38:04] (03CR) 10Andrew Bogott: [C:03+2] toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [12:38:08] (03CR) 10Andrew Bogott: [C:03+2] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [12:39:06] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:39:32] (03PS2) 10Anzx: cswiki: lift IP cap for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261420 (https://phabricator.wikimedia.org/T421305) [12:39:44] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:39:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261420 (https://phabricator.wikimedia.org/T421305) (owner: 10Anzx) [12:40:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11753417 (10Jclark-ctr) Unable to wrap my head around the mistake this early today. Likely a typo in the wikikubes... [12:42:32] (03PS6) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [12:43:54] !log puppet reenabled on drmrs, CIDERGRINDER deployed [12:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:58] (03Merged) 10jenkins-bot: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [12:45:15] (03Merged) 10jenkins-bot: toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [12:45:55] (03CR) 10Brouberol: [C:03+2] Update dse-k8s-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [12:50:47] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:51:31] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:52:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:54:04] (03PS1) 10Effie Mouzeli: mw-parsoid: remove DNS of mw-parsoid LVS service 1 [dns] - 10https://gerrit.wikimedia.org/r/1261428 (https://phabricator.wikimedia.org/T420468) [12:57:25] (03CR) 10Kamila Součková: "Thanks @tstarling@wikimedia.org, that's what I thought, but I feel much better about this now :D Much appreciated!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [12:57:54] (03PS1) 10Effie Mouzeli: mw-parsoid: Remove probes 2 [puppet] - 10https://gerrit.wikimedia.org/r/1261430 (https://phabricator.wikimedia.org/T420468) [13:00:05] Urbanecm and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1300) [13:00:05] Raine, Sergi0, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:23] o/ [13:00:55] o/ [13:01:14] o/ [13:01:24] (03PS1) 10Effie Mouzeli: mw-parsoid: switch to service_setup 3 [puppet] - 10https://gerrit.wikimedia.org/r/1261433 (https://phabricator.wikimedia.org/T420468) [13:01:48] I can self-deploy mine [13:02:33] need someone to deploy mine [13:03:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:03:15] anzx: I can push the buttons, I just don't know how to test it :D [13:03:58] testing not needed, but maintenance script needs to be run after deploying [13:04:03] ack, I can do that [13:04:13] then I'll start with yours? [13:04:18] ok [13:05:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261420 (https://phabricator.wikimedia.org/T421305) (owner: 10Anzx) [13:05:18] ok with me, I can self-deploy mine after the rest [13:05:19] (03PS1) 10Btullis: Revert "Temporarily suspend the flink applications running in dse-k8s-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261436 [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:33] thanks sergi0! [13:06:04] (03Merged) 10jenkins-bot: cswiki: lift IP cap for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261420 (https://phabricator.wikimedia.org/T421305) (owner: 10Anzx) [13:06:08] (03CR) 10Brouberol: [C:03+1] Revert "Temporarily suspend the flink applications running in dse-k8s-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261436 (owner: 10Btullis) [13:06:26] !log kamila@deploy2002 Started scap sync-world: Backport for [[gerrit:1261420|cswiki: lift IP cap for editathon (T421305)]] [13:06:31] T421305: Lift IP cap on 2026-03-27 for Senior Citizen Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T421305 [13:06:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:06:59] (03PS1) 10Btullis: Revert "Temporarily disable the deployment of mediawiki to dse-k8s-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1261438 [13:08:04] (03CR) 10Btullis: [C:03+2] Revert "Temporarily suspend the flink applications running in dse-k8s-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261436 (owner: 10Btullis) [13:08:29] (03CR) 10Btullis: [C:03+2] Revert "Temporarily disable the deployment of mediawiki to dse-k8s-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1261438 (owner: 10Btullis) [13:08:35] !log kamila@deploy2002 kamila, anzx: Backport for [[gerrit:1261420|cswiki: lift IP cap for editathon (T421305)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:49] !log deploying new grants for new ms-backup hosts and removing old ones T420464 [13:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:55] T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12] - https://phabricator.wikimedia.org/T420464 [13:09:20] !log kamila@deploy2002 kamila, anzx: Continuing with sync [13:10:13] (03Merged) 10jenkins-bot: Revert "Temporarily suspend the flink applications running in dse-k8s-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261436 (owner: 10Btullis) [13:11:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:01] !log btullis@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-dse-aa,name=eqiad [13:13:49] !log kamila@deploy2002 Finished scap sync-world: Backport for [[gerrit:1261420|cswiki: lift IP cap for editathon (T421305)]] (duration: 07m 22s) [13:13:54] T421305: Lift IP cap on 2026-03-27 for Senior Citizen Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T421305 [13:14:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [13:14:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [13:14:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:14:45] anzx: maintenance script finished [13:14:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:14:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [13:14:56] Raine: thanks for deploying [13:14:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [13:15:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:15:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [13:15:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:15:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:15:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:15:40] anzx: thanks to you too :-) [13:15:42] (prolly should have bundled that with mine, but I'm chicken today :D) [13:15:51] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1202: Pool db1202.eqiad.wmnet in after cloning [13:16:49] (03Merged) 10jenkins-bot: Temporarily add shellbox-icu to $wgShellboxUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [13:17:11] !log kamila@deploy2002 Started scap sync-world: Backport for [[gerrit:1256384|Temporarily add shellbox-icu to $wgShellboxUrls (T419049 T419242 T419274)]] [13:17:19] T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049 [13:17:19] T419242: Migrate collation data to ICU 72 - https://phabricator.wikimedia.org/T419242 [13:17:20] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [13:19:12] !log kamila@deploy2002 kamila: Backport for [[gerrit:1256384|Temporarily add shellbox-icu to $wgShellboxUrls (T419049 T419242 T419274)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:19:33] (03CR) 10Jcrespo: [C:03+2] mariadb: Update grants for new hosts ms-backup[12]00[34], which replace [12] [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [13:20:00] !log kamila@deploy2002 kamila: Continuing with sync [13:20:42] (03CR) 10Jcrespo: [C:03+2] "Deployed now on production. Not deployed/removed on x1/x3." [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [13:24:27] !log kamila@deploy2002 Finished scap sync-world: Backport for [[gerrit:1256384|Temporarily add shellbox-icu to $wgShellboxUrls (T419049 T419242 T419274)]] (duration: 07m 16s) [13:24:35] T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049 [13:24:35] T419242: Migrate collation data to ICU 72 - https://phabricator.wikimedia.org/T419242 [13:24:35] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [13:24:49] sergi0: your turn [13:26:45] !log jforrester@deploy2002 Started deploy [integration/docroot@f021d3f]: Ia936ecd68e675cff2925dba933e3b67b9bad4cd6 [13:26:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11753681 (10Jclark-ctr) I also see a typo on wikikube-worker1371 Mac address in Netbox 90:5A:08:7B:1F:6D (netbox)... [13:26:57] !log jforrester@deploy2002 Finished deploy [integration/docroot@f021d3f]: Ia936ecd68e675cff2925dba933e3b67b9bad4cd6 (duration: 00m 11s) [13:27:26] @Raine ty! [13:29:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259132 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno) [13:29:29] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:30:11] (03Merged) 10jenkins-bot: GrowthExperiments: scale edit and thanks query limit to more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259132 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno) [13:30:29] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1259132|GrowthExperiments: scale edit and thanks query limit to more wikis (T341599)]] [13:30:35] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:32:31] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1259132|GrowthExperiments: scale edit and thanks query limit to more wikis (T341599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:33:35] (03CR) 10Jcrespo: [C:03+2] mediabackup: Add resharding script using rclone [puppet] - 10https://gerrit.wikimedia.org/r/1261387 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [13:34:45] (03PS7) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [13:34:47] (03PS1) 10Jcrespo: mariadb: Remove application grants for both backup1-* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1261447 (https://phabricator.wikimedia.org/T420464) [13:35:22] !log sgimeno@deploy2002 sgimeno: Continuing with sync [13:36:26] (03PS2) 10Jcrespo: mariadb: Remove ms-backup[12]00[12] app grants for both backup1-* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1261447 (https://phabricator.wikimedia.org/T420464) [13:36:53] (03CR) 10Eevans: services: add linked-artifacts service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:38:32] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:38:56] (03CR) 10Jcrespo: "permission to deploy this?" [puppet] - 10https://gerrit.wikimedia.org/r/1261447 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [13:39:47] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259132|GrowthExperiments: scale edit and thanks query limit to more wikis (T341599)]] (duration: 09m 17s) [13:39:52] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:40:34] (03PS2) 10Elukey: mcrouter: ease testing new cli parameters [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) [13:40:48] !log UTC afternoon backport window done [13:40:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11753761 (10Jclark-ctr) Thanks for checking @Volans I still do not see any issues with backup1012 only issues when... [13:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:53] (03CR) 10Elukey: "Should be done now :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) (owner: 10Elukey) [13:41:38] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11753765 (10ABran-WMF) 05Stalled→03Open [13:43:08] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11753774 (10jcrespo) Any update? Even a "No work done, I plan to work on this next X" would be useful. [13:45:45] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [13:46:28] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [13:50:19] (03CR) 10Jforrester: "We can do even further bumps, but we should not do this in staging, of which we're already take too much. Will amend." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [13:50:20] (03PS1) 10Elukey: services: re-add maps1012 in service for Kartotherian/Tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261452 (https://phabricator.wikimedia.org/T421350) [13:51:11] (03PS1) 10Jcrespo: mediabackup: Set ms-backup[12]00[12] as spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) [13:51:38] (03CR) 10Federico Ceratto: [C:03+1] "I checked the ipaddrs on netbox and they match ms-backup[12]00[12]" [puppet] - 10https://gerrit.wikimedia.org/r/1261447 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [13:52:28] (03PS2) 10Jcrespo: mediabackup: Set ms-backup[12]00[12] as spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) [13:52:44] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [13:53:30] (03CR) 10Jcrespo: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1261447 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [13:53:59] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1055.eqiad.wmnet [13:54:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1055.eqiad.wmnet [13:55:05] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1097.eqiad.wmnet [13:55:24] (03PS2) 10Jforrester: wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) [13:55:24] (03PS2) 10Jforrester: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [13:55:24] (03PS1) 10Jforrester: wikifunctions: Slim down staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 [13:55:41] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1097.eqiad.wmnet [13:55:57] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1070.eqiad.wmnet [13:56:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1070.eqiad.wmnet [13:57:06] (03PS2) 10Jforrester: wikifunctions: Slim down staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 [13:57:06] (03PS3) 10Jforrester: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [13:57:06] (03PS3) 10Jforrester: wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) [13:57:07] (03PS1) 10Jforrester: wikifunctions: Make old Bash check script call the Python one [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) [13:57:37] !log dropping ms-backup[12]00[12] grants from backup1-* dbs T420464 [13:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:42] T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12] - https://phabricator.wikimedia.org/T420464 [13:57:46] (03CR) 10CI reject: [V:04-1] wikifunctions: Slim down staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [13:59:56] (03CR) 10Ssingh: "I have nothing to add to the network side of the discussion since you both are the experts, but I just wanted to check the path forward on" [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney) [14:00:24] (03CR) 10Jcrespo: [C:03+2] "old grants dropped" [puppet] - 10https://gerrit.wikimedia.org/r/1261447 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [14:01:20] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1202: Pool db1202.eqiad.wmnet in after cloning [14:04:42] (03PS3) 10Jcrespo: mediabackup: Set ms-backup[12]00[12] as spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) [14:04:48] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [14:10:38] (03PS1) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [14:12:04] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [14:14:06] (03PS2) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [14:15:32] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261461 [14:15:51] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261461 (owner: 10PipelineBot) [14:15:51] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [14:15:54] (03CR) 10Jgiannelos: [V:03+2 C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261461 (owner: 10PipelineBot) [14:17:22] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:17:30] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:17:34] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:17:47] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:18:16] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:18:20] (03PS3) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [14:18:33] (03PS1) 10Elukey: sre.hosts.provision: check presence of BMC/BIOS versions [cookbooks] - 10https://gerrit.wikimedia.org/r/1261463 (https://phabricator.wikimedia.org/T418929) [14:18:37] (03PS1) 10Blake: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1261464 (https://phabricator.wikimedia.org/T413974) [14:18:39] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:19:19] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:19:32] we intend to switch the deployment server over to eqiad in about 40m [14:19:47] (03PS4) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [14:20:02] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:20:07] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:20:56] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:21:12] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:21:13] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [14:21:25] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [14:21:56] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [14:22:18] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [14:22:42] (03PS5) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [14:23:32] (03PS1) 10Blake: hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1261465 (https://phabricator.wikimedia.org/T413974) [14:23:51] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [14:24:10] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [14:24:29] elukey@cumin1003 provision (PID 2414545) is awaiting input [14:24:36] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: check presence of BMC/BIOS versions [cookbooks] - 10https://gerrit.wikimedia.org/r/1261463 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [14:27:48] (03PS1) 10Volans: .wmfconfig: build for bookworm and trixie [software/cumin] - 10https://gerrit.wikimedia.org/r/1261466 [14:28:43] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:29:07] (03CR) 10Elukey: [C:03+1] .wmfconfig: build for bookworm and trixie [software/cumin] - 10https://gerrit.wikimedia.org/r/1261466 (owner: 10Volans) [14:29:44] (03CR) 10Volans: [C:03+2] .wmfconfig: build for bookworm and trixie [software/cumin] - 10https://gerrit.wikimedia.org/r/1261466 (owner: 10Volans) [14:30:01] (03PS2) 10Elukey: sre.hosts.provision: check presence of BMC/BIOS versions [cookbooks] - 10https://gerrit.wikimedia.org/r/1261463 (https://phabricator.wikimedia.org/T418929) [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1430) [14:30:40] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:31:29] (03PS1) 10Brouberol: dse-k8s-eqiad/istio: mirror configuration from the wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261467 (https://phabricator.wikimedia.org/T414484) [14:33:31] (03CR) 10Elukey: wikifunctions: Slim down staging resources (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [14:34:32] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261467 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [14:39:17] (03PS3) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 [14:39:17] (03CR) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [14:39:17] (03PS4) 10Jforrester: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [14:39:21] (03PS4) 10Jforrester: wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) [14:39:25] (03PS2) 10Jforrester: wikifunctions: Make old Bash check script call the Python one [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) [14:39:40] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:39:55] (03PS1) 10Kamila Součková: Enable $wgTempCategoryCollations for testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261470 (https://phabricator.wikimedia.org/T419274) [14:41:22] (03CR) 10CI reject: [V:04-1] Enable $wgTempCategoryCollations for testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261470 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [14:45:03] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad/istio: mirror configuration from the wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261467 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [14:47:35] (03Merged) 10jenkins-bot: .wmfconfig: build for bookworm and trixie [software/cumin] - 10https://gerrit.wikimedia.org/r/1261466 (owner: 10Volans) [14:49:50] (03PS2) 10Kamila Součková: Enable $wgTempCategoryCollations for testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261470 (https://phabricator.wikimedia.org/T419274) [14:50:35] (03CR) 10Jforrester: wikifunctions: Slim down staging resources, and fix main staging config (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [14:50:41] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11754164 (10Jhancock.wm) Hey. The issue turned out to be me more invovled than i thought. I need to replace the power distribution board in... [14:52:30] (03CR) 10Dzahn: "yea, but is "we need other things too" a reason to not do a thing we definitely need?" [puppet] - 10https://gerrit.wikimedia.org/r/1260816 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [14:53:33] (03CR) 10Btullis: [C:03+1] trafficserver: enabling access to airflow-fr-tech.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1260762 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [14:53:56] (03CR) 10Brouberol: [C:03+2] trafficserver: enabling access to airflow-fr-tech.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1260762 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [14:55:08] (03CR) 10Federico Ceratto: "Just for tracking - this is partially related to T420475" [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [14:55:23] (03PS2) 10Blake: wmnet: update deployment CNAME record to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1261464 (https://phabricator.wikimedia.org/T413974) [14:56:02] (03CR) 10Jasmine: [C:03+1] wmnet: update deployment CNAME record to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1261464 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [14:56:20] (03CR) 10Jasmine: [C:03+1] hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1261465 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [14:56:58] (03PS10) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [14:57:10] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) [15:00:04] Deploy window Deployment server switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1500) [15:00:15] (03CR) 10JavierMonton: stream: mw-page-html-content-change-enrich-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [15:00:57] (03CR) 10Dillon: [C:03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [15:01:37] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1253: Pool db1253.eqiad.wmnet in after cloning [15:01:38] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11754219 (10jcrespo) > Might be next week before i can finish that out. I'll let you know Thank you, that's all I needed to know. Take your... [15:01:40] (03PS1) 10Dzahn: zuul: add tls_truststore value on executor role [puppet] - 10https://gerrit.wikimedia.org/r/1261476 [15:02:16] (03CR) 10Ssingh: [C:03+1] wmnet: update deployment CNAME record to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1261464 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [15:03:04] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:03:25] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:04:45] (03CR) 10Dzahn: [C:03+2] zuul: add tls_truststore value on executor role [puppet] - 10https://gerrit.wikimedia.org/r/1261476 (owner: 10Dzahn) [15:07:47] 10SRE-Access-Requests, 06Wikimedia Enterprise: Requesting Access to Data Engineering Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11754265 (10JArguello-WMF) [15:08:05] (03CR) 10Dillon: [C:03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) (owner: 10Scardenasmolinar) [15:08:51] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: check presence of BMC/BIOS versions [cookbooks] - 10https://gerrit.wikimedia.org/r/1261463 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [15:10:08] (03PS1) 10Jforrester: Replace WANObjectCache with new MemcachedWrapper concept [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261477 (https://phabricator.wikimedia.org/T419666) [15:16:45] (03PS2) 10Jforrester: Wikifunctions: Switch cache from mcrouter-wikifunctions to special access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256432 (https://phabricator.wikimedia.org/T419666) [15:19:46] !log blake@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet,releases1003.eqiad.wmnet with reason: Deployment server switchover [15:20:48] (03CR) 10Blake: [C:03+2] wmnet: update deployment CNAME record to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1261464 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [15:21:27] !log blake@dns1004 START - running authdns-update [15:21:58] (03PS15) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) [15:22:07] !log updating dns for the deployment host switchover [15:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:13] (03CR) 10Elukey: [C:03+2] services: re-add maps1012 in service for Kartotherian/Tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261452 (https://phabricator.wikimedia.org/T421350) (owner: 10Elukey) [15:23:12] !log blake@dns1004 END - running authdns-update [15:23:38] 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11754384 (10RobH) Who is best to address and fix this bug? [15:24:00] (03CR) 10Blake: [C:03+2] hieradata: update deployment_server to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1261465 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [15:29:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11754466 (10Scott_French) 05In progress→03Resolved a:03Scott_French @Daria-WMDE - Your access sho... [15:30:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11754469 (10Scott_French) 05In progress→03Resolved a:03Scott_French @Alice.moutinho - Your access should be live now. Please reo... [15:30:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11754472 (10Scott_French) 05In progress→03Resolved a:03Scott_French @kera_wmde - Your access should be live now. Please reope... [15:30:34] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [15:30:59] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [15:32:51] (03PS1) 10Btullis: Add an LDAP group to the list considered during offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1261481 (https://phabricator.wikimedia.org/T417213) [15:33:54] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [15:34:44] (03CR) 10Vgutierrez: "check version of this cookbook implemented in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1259927" [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [15:40:43] (03CR) 10Vgutierrez: "sample dry-run output against ulsfo nodes can be seen here https://phabricator.wikimedia.org/P89950" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [15:41:51] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11754548 (10katiamusiolekwmde) >>! In T420459#11733750, @Scott_French wrote: > @katiamusiolekwmde - A couple of items: > 1. I don't see a Developer (LDAP) account associa... [15:42:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11754552 (10katiamusiolekwmde) just created a developer account and yes, I am applying for level 1 access [15:42:40] (03PS1) 10Fabfur: cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) [15:43:01] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on zuul1002.eqiad.wmnet with reason: T421330 [15:43:06] T421330: SystemdUnitFailed - zuul-scheduler - https://phabricator.wikimedia.org/T421330 [15:43:28] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on zuul2002.codfw.wmnet with reason: T421330 [15:43:39] (03PS1) 10JMeybohm: prometheus::k8s: Ingest envoy cluster_update metrics [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) [15:44:04] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [15:45:12] jouncebot: nowandnext [15:45:12] For the next 0 hour(s) and 14 minute(s): Deployment server switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1500) [15:45:12] In 0 hour(s) and 14 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1600) [15:45:24] (03CR) 10Herron: "Am I understanding right that we would need to run the cookbook whenever a new prom instance is added in order for the compactors to pick " [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [15:45:59] (03CR) 10Btullis: [C:03+2] Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [15:46:03] !log blake@deploy1003 Started scap sync-world: Test deployment to validate deployment server switchover - T413974 [15:46:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [15:46:07] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [15:47:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1253: Pool db1253.eqiad.wmnet in after cloning [15:47:08] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [15:47:10] (03CR) 10Brouberol: [C:03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [15:47:15] (03PS2) 10JMeybohm: prometheus::k8s: Ingest envoy cluster_update metrics [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) [15:47:50] (03Merged) 10jenkins-bot: Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [15:50:40] (03PS2) 10Fabfur: cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) [15:50:51] hashar: fyi we do need the puppet window today, even though we don't usually :) [15:50:55] andre: I have marked the train resolved! Thank you for having checked-in each morning! :] [15:51:06] ahaha, congrats! [15:51:20] (03PS3) 10Fabfur: cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) [15:51:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [15:51:40] rzl: yes yes no worries. I was checking whether there was a slot for some mediawiki hotfix and given there is the switchover going it will land in next week train (it was an unimportant change) [15:51:44] rzl: thanks for the notice! [15:51:51] cool :) [15:58:22] (03CR) 10BCornwall: prometheus::ops: Monitor IPIP realservers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [15:58:22] (03PS1) 10Fabfur: hiera: upgrade haproxy to version 3.2 on cp6001 and cp6009 [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) [15:58:35] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [15:59:15] (03PS47) 10CDobbins: (traffic): WIP for depooled cp hosts alert [alerts] - 10https://gerrit.wikimedia.org/r/1217262 [15:59:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11754661 (10ABran-WMF) I've been able to test my change on [[ https://wikitech.wikimedia.org/wiki/Puppet/Pontoon | Pontoon ]]:... [16:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1600). [16:00:05] James_F and genocation: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:09] Whee. [16:00:10] o/ [16:00:23] bjensen: understand you're running late, no worries, just let us know [16:00:50] rzl: will do, still making progress :) [16:00:59] rad [16:01:00] Awesome. [16:02:02] (03CR) 10BCornwall: [C:03+1] cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [16:02:04] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [16:02:17] (03CR) 10Ottomata: [C:03+1] stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261377 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [16:02:23] meantime, for James_F and genocation's awareness, the process will be +2 and merge, puppet-merge, run puppet on the deploy host, and then scap (with the "don't build, just helmfile deploy" flags) to pick up the changes [16:02:30] * James_F nods. [16:02:45] I'll run all those and let you know how it's going, but when we pause at mw-debug, the new httpbb tests will run and I'll ask you to manually test too [16:02:52] got it! [16:03:30] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11754678 (10BTullis) >>! In T419820#11730234, @Scott_French wrote: > @BTullis - I see this is assigned to you. Do you need any assistanc... [16:05:36] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [16:07:42] (03CR) 10Daniel Kinzler: [C:04-1] "CR-1 because we should introduce all four planned limits." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [16:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:24] (03CR) 10Herron: "niiice! added a couple comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [16:14:15] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway: add values for auth-newuser rate limiting class for feature patch (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [16:16:06] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [16:17:12] !log blake@deploy1003 Finished scap sync-world: Test deployment to validate deployment server switchover - T413974 (duration: 31m 09s) [16:17:16] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [16:17:40] the deployment server switchover is complete, thanks! [16:17:46] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [16:17:47] going! [16:17:55] (03CR) 10RLazarus: [C:03+2] Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza) [16:18:23] yep I am the first one using the wrong deployment host, sorry : [16:18:24] :) [16:19:19] please submit an ascii-art picture of your face to hang in shame on the new motd [16:20:26] James_F, genocation: puppet-merge is done, running puppet on deploy1003 now (this takes a bit) [16:20:30] Ack. [16:22:07] elukey: my email to ops could have been earlier :D [16:23:40] (03PS1) 10JHathaway: nftables: cleanup tests [puppet] - 10https://gerrit.wikimedia.org/r/1261497 [16:24:22] (03CR) 10CI reject: [V:04-1] nftables: cleanup tests [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway) [16:25:02] (03CR) 10Vgutierrez: [C:03+1] cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [16:26:27] (03PS2) 10JHathaway: nftables: cleanup tests [puppet] - 10https://gerrit.wikimedia.org/r/1261497 [16:27:57] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [16:28:03] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway) [16:28:04] done! scapping (reviewing the helmfile diffs first) [16:28:31] (03PS4) 10Fabfur: cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) [16:28:51] (03CR) 10Fabfur: cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [16:29:13] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [16:30:02] James_f, genocation: fyi this is the generated apache config diff https://www.irccloud.com/pastebin/TNG04GYM/ [16:30:39] Ack. So much complication. [16:30:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:48] (03PS3) 10JHathaway: nftables: cleanup tests [puppet] - 10https://gerrit.wikimedia.org/r/1261497 [16:30:54] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway) [16:31:00] line 57 there is the one we wanted to make sure we saw [16:31:10] Yes. [16:31:14] yeah [16:31:38] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1256396 T420666 [16:31:44] T420666: Our /view/ URLs aren't working on abstract.wikipedia.org, so we're generating links for users that don't work - https://phabricator.wikimedia.org/T420666 [16:33:28] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#11754828 (10BCornwall) It's been consistent behavior for some weeks now - both downtimes are removed at once after the reboot occurs. Not sure if... [16:33:30] jouncebot: nowandnext [16:33:30] For the next 0 hour(s) and 26 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1600) [16:33:30] In 0 hour(s) and 26 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1700) [16:33:30] In 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1700) [16:33:37] OK, change in behaviour in mw-debug; https://abstract.wikipedia.org/view/fr/Q31 now shows the correct thing. [16:33:43] Dreamy_Jazz: I can let you know when I'm done :) [16:33:49] Yes, thanks [16:33:52] (want to use scap) [16:33:56] rzl: LGTM to procede. [16:34:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:17] James_F: oh good, I'm still waiting for the testserver phase to finish :) [16:34:42] rzl: The benefit of the *actual* mw-debug servers we use scapping fast, as opposed to the boring ones that fill out the 12. [16:34:49] haha yeah [16:34:52] !log rzl@deploy1003 rzl: https://gerrit.wikimedia.org/r/1256396 T420666 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:35:36] sweet, httpbb is happy too (which means the new tests pass, and we also didn't break the rest of the world) [16:35:46] pfew [16:35:53] going ahead [16:35:56] !log rzl@deploy1003 rzl: Continuing with sync [16:36:35] genocation: congratulations, you edited the highest-traffic open-source apache config in the world, and lived to talk about it :) [16:37:11] \o/ [16:37:21] <3 [16:37:30] thank you for your help! [16:37:53] (03CR) 10Scott French: [C:03+1] "Thanks, Raine!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261470 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [16:37:58] this will keep wrapping up in the usual scap way, but as long as nothing goes wrong[tm] there's nothing more for you both to do, so feel free to wander off [16:38:13] or re-test without the debug extension when it finishes, if you like, obviously [16:38:21] (03PS16) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) [16:38:31] (03CR) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [16:38:35] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [16:38:41] Just re-tested without mw-debug and it worked. [16:39:54] !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1256396 T420666 (duration: 11m 21s) [16:39:59] T420666: Our /view/ URLs aren't working on abstract.wikipedia.org, so we're generating links for users that don't work - https://phabricator.wikimedia.org/T420666 [16:40:22] yep, I like it [16:40:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:41:10] (03CR) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [16:41:30] ^ that's normal, just unlucky timing -- cumin2002 picked up the new httpbb tests before the apache config was actually live [16:41:33] it'll clear on its own [16:41:47] all set here I think [16:41:58] Dreamy_Jazz: over to you -- don't forget deploy1003 is active now :) [16:42:15] Thanks! [16:47:33] (03PS2) 10Daniel Kinzler: [WIP] rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (owner: 10Bartosz Dziewoński) [16:50:24] (03CR) 10Daniel Kinzler: [C:03+1] "This changes the behavior so known-network and known-client override any rlc claims from the JWT. That's fine as long as we don't have any" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (owner: 10Bartosz Dziewoński) [16:50:43] (03PS17) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) [16:50:58] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [16:53:12] (03PS2) 10Vgutierrez: haproxy: Make host header validation mandatory [puppet] - 10https://gerrit.wikimedia.org/r/1237922 [16:53:22] (03PS48) 10CDobbins: (traffic): add alert for depooled cp* hosts [alerts] - 10https://gerrit.wikimedia.org/r/1217262 [16:53:33] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237922 (owner: 10Vgutierrez) [16:54:34] (03PS44) 10Tiziano Fogli: sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) [16:55:20] (03PS4) 10Scott French: mw-*: Use envoy drain configuration everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) [16:55:50] (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [16:57:55] (03CR) 10Fabfur: [C:03+1] "good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/1237922 (owner: 10Vgutierrez) [16:58:09] (03CR) 10CI reject: [V:04-1] mw-*: Use envoy drain configuration everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [16:58:17] Testing... [16:59:53] (03CR) 10Vgutierrez: (traffic): add alert for depooled cp* hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1217262 (owner: 10CDobbins) [17:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1700) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1700) [17:00:43] (03CR) 10Vgutierrez: [C:03+2] haproxy: Make host header validation mandatory [puppet] - 10https://gerrit.wikimedia.org/r/1237922 (owner: 10Vgutierrez) [17:01:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261470 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [17:04:09] scap finished [17:04:45] (03CR) 10Tiziano Fogli: "Yes, that's right." [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [17:05:30] (03CR) 10Vgutierrez: (traffic): add alert for depooled cp* hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1217262 (owner: 10CDobbins) [17:17:28] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#11754946 (10jcrespo) >>! In T196336#11754828, @BCornwall wrote: > It's been consistent behavior for some weeks now - both downtimes are removed at... [17:28:28] (03PS1) 10Eevans: cassandra_dev: add aqsloader grants to staging [puppet] - 10https://gerrit.wikimedia.org/r/1261504 (https://phabricator.wikimedia.org/T420008) [17:29:33] (03CR) 10Eevans: [C:03+2] cassandra_dev: add aqsloader grants to staging [puppet] - 10https://gerrit.wikimedia.org/r/1261504 (https://phabricator.wikimedia.org/T420008) (owner: 10Eevans) [17:29:57] (03CR) 10Herron: [C:03+1] "Yes lets make puppet complain loudly about this condition." [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [17:31:36] jouncebot: nowandnext [17:31:36] For the next 0 hour(s) and 28 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1700) [17:31:36] For the next 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T1700) [17:31:36] In 2 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T2000) [17:31:37] (03PS1) 10Eevans: Revert "cassandra_dev: add aqsloader grants to staging" [puppet] - 10https://gerrit.wikimedia.org/r/1261506 [17:31:52] (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:32:32] (03CR) 10Eevans: [C:03+2] Revert "cassandra_dev: add aqsloader grants to staging" [puppet] - 10https://gerrit.wikimedia.org/r/1261506 (owner: 10Eevans) [17:32:47] if no one else was about to use the MW infra window, I'd hijack it to charlie out the envoy updates in eqiad, but I'm guessing s.wfrench is going to get in there now that our meeting is done [17:34:00] o/ [17:34:19] I'm running a bit behind on what I had in mind, so please feel free to go ahead [17:34:46] are you sure? I'd probably take up the rest of the window, more than happy to do itthis afternoon instead [17:35:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:35:36] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [17:36:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11754978 (10Aklapper) Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/), so your 'LDAP User' ac... [17:40:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:41:05] (03PS4) 10Herron: kafkamon: rename class [puppet] - 10https://gerrit.wikimedia.org/r/1253505 (https://phabricator.wikimedia.org/T418858) [17:42:32] alright, after discussion with r.zl, I'll go ahead with my changes momentarily [17:42:38] (03CR) 10Scott French: [C:03+2] mw-*: Use envoy drain configuration everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:45:25] (03CR) 10Herron: [C:03+2] kafkamon: rename class [puppet] - 10https://gerrit.wikimedia.org/r/1253505 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [17:45:57] (03Merged) 10jenkins-bot: mw-*: Use envoy drain configuration everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:46:44] (03PS1) 10Majavah: P:toolforge::prometheus: Disable istio-gateway scrape for now [puppet] - 10https://gerrit.wikimedia.org/r/1261510 (https://phabricator.wikimedia.org/T421386) [17:47:43] (03PS1) 10Reedy: InitialiseSettings: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261511 (https://phabricator.wikimedia.org/T421413) [17:49:00] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Disable istio-gateway scrape for now [puppet] - 10https://gerrit.wikimedia.org/r/1261510 (https://phabricator.wikimedia.org/T421386) (owner: 10Majavah) [17:52:40] !log swfrench@deploy1003 Started scap sync-world: helmfile-only deployment to enable envoy drain on remaining services - T364245 [17:52:45] T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis - https://phabricator.wikimedia.org/T364245 [17:55:13] !log swfrench@deploy1003 Finished scap sync-world: helmfile-only deployment to enable envoy drain on remaining services - T364245 (duration: 05m 31s) [17:56:40] rzl: all yours [17:57:31] thanks! [17:58:35] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/apertium: apply [17:59:10] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [17:59:15] 06SRE, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11755060 (10lerickson) [18:02:38] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [18:02:41] (03PS1) 10Arlolra: Use prod to serve maps in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261515 (https://phabricator.wikimedia.org/T420299) [18:03:07] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [18:03:30] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:03:34] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:04:00] 06SRE, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11755118 (10lerickson) Update: * The table exists: wikidata.wdqs_external_queries_by_user_agent_daily aggregates the queries b... [18:04:22] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [18:04:49] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [18:06:14] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [18:06:39] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [18:06:45] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [18:07:18] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [18:07:24] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [18:07:49] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [18:07:57] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:08:12] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:08:18] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [18:08:39] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [18:08:45] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/echostore: apply [18:09:12] (03CR) 10BryanDavis: [C:03+1] Use prod to serve maps in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261515 (https://phabricator.wikimedia.org/T420299) (owner: 10Arlolra) [18:09:25] (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) [18:09:40] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [18:09:54] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [18:10:14] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [18:11:00] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [18:11:21] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [18:11:27] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [18:11:52] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [18:12:07] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [18:12:32] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [18:12:56] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [18:13:15] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [18:13:24] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [18:13:45] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [18:14:22] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [18:15:06] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [18:16:32] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:17:37] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:17:48] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [18:18:04] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [18:18:49] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{sessionstore1006.eqiad.wmnet} and P{P:Cassandra} [18:19:13] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [18:19:30] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [18:19:59] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [18:20:20] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [18:20:34] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [18:20:59] (03CR) 10BCornwall: "Looking great. Placeholders need to be handled." [alerts] - 10https://gerrit.wikimedia.org/r/1217262 (owner: 10CDobbins) [18:21:40] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [18:21:46] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [18:21:53] (03CR) 10BCornwall: [C:03+1] cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [18:22:07] FIRING: ProbeDown: Service sessionstore1006-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1006-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:36] (03PS1) 10CDanis: haproxy: CIDERGRINDER 🍎 small fixes [puppet] - 10https://gerrit.wikimedia.org/r/1261518 [18:23:04] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261518 (owner: 10CDanis) [18:23:06] (03CR) 10BCornwall: [C:03+1] DO NOT MERGE: wmnet: add ms1, ms2 and remove x2 following [0] [dns] - 10https://gerrit.wikimedia.org/r/1260132 (https://phabricator.wikimedia.org/T387332) (owner: 10Jasmine) [18:24:12] (03CR) 10BCornwall: [C:03+1] haproxy: CIDERGRINDER 🍎 small fixes [puppet] - 10https://gerrit.wikimedia.org/r/1261518 (owner: 10CDanis) [18:25:52] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on P{sessionstore1006.eqiad.wmnet} and P{P:Cassandra} [18:26:26] (03CR) 10CDanis: [C:03+2] haproxy: CIDERGRINDER 🍎 small fixes [puppet] - 10https://gerrit.wikimedia.org/r/1261518 (owner: 10CDanis) [18:26:31] (03PS12) 10Herron: site: opt-in insetup defaults by hostname prefix [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) [18:26:46] (03CR) 10CI reject: [V:04-1] site: opt-in insetup defaults by hostname prefix [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: 10Herron) [18:27:07] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:27:19] (03CR) 10Catrope: [C:03+1] config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [18:27:31] (03PS13) 10Herron: site: opt-in insetup defaults by hostname prefix [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) [18:27:32] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [18:28:00] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [18:28:27] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [18:28:40] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:30:16] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:30:41] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [18:30:44] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:30:59] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm) [18:30:59] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [18:31:02] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm) [18:31:15] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [18:31:20] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [18:31:48] (03PS3) 10JMeybohm: prometheus::k8s: Ingest envoy cluster_update metrics [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) [18:31:59] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm) [18:32:07] RESOLVED: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:32:24] (03CR) 10C. Scott Ananian: [C:03+1] Use prod to serve maps in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261515 (https://phabricator.wikimedia.org/T420299) (owner: 10Arlolra) [18:32:39] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [18:32:41] federico3, arnoldokoth can CTT do an slightly early spiderpig deploy to mediawiki-config of [Use prod to serve maps in labs (1261515) · Gerrit Code Review](https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1261515) ? [18:33:07] we're not available during the deploy window in ~90min but are available now [18:33:21] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [18:33:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261515 (https://phabricator.wikimedia.org/T420299) (owner: 10Arlolra) [18:33:48] cscott: ok [18:33:53] cscott: not answering for the oncallers -- if you do, I'll pause what I'm doing, but happy do [18:33:57] *happy to [18:33:58] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [18:34:12] rzl: how much time do you need? [18:34:19] 10-15 or so probably [18:34:33] that's fine with me, just ping me when you're done [18:34:37] will do, thanks :) [18:34:40] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [18:34:43] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [18:35:03] (working alphabetically from apertium to zotero, and as you can see I'm up to R so most of the way there) [18:35:12] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [18:35:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:35:35] rzl: sounds Riveting [18:35:37] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [18:35:45] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [18:36:00] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [18:36:06] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [18:36:44] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs4009.ulsfo.wmnet} and A:liberica [18:36:54] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [18:37:06] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [18:37:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:37:37] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [18:37:45] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [18:38:01] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [18:38:07] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:38:23] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:38:31] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [18:38:53] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [18:38:58] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [18:39:29] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [18:39:38] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/termbox: apply [18:40:11] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs4009.ulsfo.wmnet} and A:liberica [18:40:21] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [18:40:34] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: apply [18:41:09] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [18:41:50] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [18:42:09] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [18:42:19] 10ops-eqiad, 06DC-Ops: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421 (10BCornwall) 03NEW [18:42:20] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [18:42:30] 10ops-eqiad, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11755335 (10BCornwall) [18:42:42] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [18:43:00] cscott: all yours [18:43:05] 10ops-eqiad, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11755340 (10BCornwall) [18:43:06] rzl: thanks! [18:43:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11755341 (10BCornwall) [18:43:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261515 (https://phabricator.wikimedia.org/T420299) (owner: 10Arlolra) [18:44:04] (03PS2) 10CDanis: Revert "fix NIC saturation exporter to be jessie-compatible 😖" [puppet] - 10https://gerrit.wikimedia.org/r/691216 (https://phabricator.wikimedia.org/T224454) [18:44:43] (03Merged) 10jenkins-bot: Use prod to serve maps in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261515 (https://phabricator.wikimedia.org/T420299) (owner: 10Arlolra) [18:47:08] (03CR) 10CDanis: [C:03+2] Revert "fix NIC saturation exporter to be jessie-compatible 😖" [puppet] - 10https://gerrit.wikimedia.org/r/691216 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [18:47:24] (03CR) 10CDanis: [C:03+2] Revert "fix NIC saturation exporter to be jessie-compatible 😖" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/691216 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [18:47:28] (03PS2) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) [18:48:22] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs4008.ulsfo.wmnet} and A:liberica [18:50:26] (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261525 (https://phabricator.wikimedia.org/T421366) [18:51:31] (03Abandoned) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261525 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [18:51:51] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs4008.ulsfo.wmnet} and A:liberica [18:53:23] 10ops-eqiad, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11755363 (10BCornwall) [18:53:25] FIRING: SystemdUnitFailed: nic-saturation-exporter.service on ml-serve2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:53:33] (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261526 (https://phabricator.wikimedia.org/T421366) [18:53:49] 10ops-eqiad, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11755365 (10BCornwall) [18:54:16] rzl, federico3, arnoldokoth: i'm done. that was faster than i expected since it was a labs-only change. [18:54:25] FIRING: [3x] SystemdUnitFailed: nic-saturation-exporter.service on ganeti2031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:26] thanks cscott [18:54:47] cscott: ack. ty. [18:58:15] (03CR) 10BCornwall: [C:03+1] mw-parsoid: remove DNS of mw-parsoid LVS service 1 [dns] - 10https://gerrit.wikimedia.org/r/1261428 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [19:00:53] (03CR) 10Ssingh: "please ping @bcornwall@wikimedia.org for the service removal if Traffic help is needed. https://wikitech.wikimedia.org/wiki/LVS#Remove_a_l" [dns] - 10https://gerrit.wikimedia.org/r/1261428 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [19:02:55] beta cluster seems to be down? [19:03:06] i'm hoping that wasn't my fault [19:03:06] (03PS3) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) [19:03:56] (03CR) 10CI reject: [V:04-1] config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [19:04:25] FIRING: [3x] SystemdUnitFailed: nic-saturation-exporter.service on ganeti2031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:48] cscott: beta has been being hugged to death by scraper bots for the last 3-4 days, so likely nothing to do with the config change [19:04:58] (03CR) 10Andrea Denisse: [C:03+2] grafana: Add a SameSite attribute to cookies [puppet] - 10https://gerrit.wikimedia.org/r/1259382 (https://phabricator.wikimedia.org/T402844) (owner: 10Andrea Denisse) [19:05:07] bd808: good (but bad) [19:05:18] what in the world is worth scraping on beta [19:06:11] LLM tarainers give not shits about what they are killing, just that they get tokens to add to the model [19:06:15] *trainers [19:06:27] https://phabricator.wikimedia.org/T420833#11754167 [19:06:56] (03PS4) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) [19:07:13] (03PS5) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) [19:08:44] bd808: seems like it might be worth getting the old WMF VPN working again, just so that we can limit traffic from non-employees to beta. [19:09:31] cscott: that would be... pretty horrible for the technical volunteers who actually use that environment [19:09:58] most of the folks who bother to report issues with Beta are not staff [19:10:01] yeah, i guess i'm suggesting we give volunteers access to the VPN as well. [19:10:31] maybe not "direct access to WMF internal network" VPN, just a way of identifying an IP block of known-good users. [19:10:37] Beta needs requestctl and related modern blocking tools [19:10:57] it's in Traffic's list of works to bring beta to parity with the CDN. we just have not been able to prioritize it properly with the other work. [19:11:00] that's on me. [19:11:06] or yes, we could paywall it all off somehow [19:12:06] sukhe: #hugsops this is not your fault, but thank you for being sad that you can't help yet [19:14:01] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11755505 (10matmarex) The `badtoken` error rate has been fluctuating over the last 3 days, it seems to roughly match the daily p... [19:14:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11755510 (10Jgreen) a:05Jgreen→03VRiley-WMF @VRiley-WMF I'm unable to log in with either the usual temporary password or the standard fundraising one. Can you please check it... [19:23:25] RESOLVED: SystemdUnitFailed: nic-saturation-exporter.service on ml-serve2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:25] RESOLVED: [2x] SystemdUnitFailed: nic-saturation-exporter.service on ganeti4008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:58] (03PS1) 10Bartosz Dziewoński: Wrap 'centralauthtoken' in a JWT [extensions/CentralAuth] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261545 (https://phabricator.wikimedia.org/T420280) [19:29:28] (03CR) 10Stoyofuku-wmf: "Agree with above - do you mind adding `Depends-On: I6436842f1ca4658b4d7c44f166219f3a8847cdc4` to the commit message?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T414368) (owner: 10LorenMora) [19:29:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/CentralAuth] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261545 (https://phabricator.wikimedia.org/T420280) (owner: 10Bartosz Dziewoński) [19:33:42] (03CR) 10Scott French: [C:03+1] "Thanks, Jasmine! This looks right per the current state of eqiad's `dbconfig`. Maybe add someone from DBA as well so it's on their radar?" [dns] - 10https://gerrit.wikimedia.org/r/1260132 (https://phabricator.wikimedia.org/T387332) (owner: 10Jasmine) [19:43:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:44:34] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6003.drmrs.wmnet} and A:liberica [19:47:50] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs6003.drmrs.wmnet} and A:liberica [19:49:26] (03PS2) 10LorenMora: Transition reading list experiment to instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T414368) [19:49:53] (03CR) 10LorenMora: "done, thank's y'all!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T414368) (owner: 10LorenMora) [19:51:55] (03CR) 10Jasmine: "Thanks! Adding Manuel" [dns] - 10https://gerrit.wikimedia.org/r/1260132 (https://phabricator.wikimedia.org/T387332) (owner: 10Jasmine) [19:58:09] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#11755855 (10Jhancock.wm) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T2000). [20:00:05] SCardenasM, RoanKattouw, katherine_g, Raine, cscott, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:41] o/ [20:00:48] hi. i'd apprecaite if one of you could ship my patch as well while doing yours :) [20:00:50] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{aqs1015.eqiad.wmnet} and P{P:Cassandra} [20:00:51] quet oncall shift [20:01:22] o/ just removed mine from the list so can continue without me [20:02:09] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:02:51] MatmaRex: I have a config patch, I suppose I can bundle your change with mine [20:03:04] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11755868 (10Jgreen) a:05Jgreen→03Jclark-ctr @Jclark-ctr I finally had a chance to look into the frmx1002/frdata1003 issue. I think the servers are in the correct rack, but... [20:03:18] (03CR) 10Catrope: [C:03+1] config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261526 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:03:34] yeah, that should be fine [20:04:01] (03CR) 10Catrope: [C:03+1] "Looks good, but should be rebased onto https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1261526 (or be rebased onto master a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:04:05] I can go ahead and deploy my changes [20:04:07] FIRING: [2x] ProbeDown: Service aqs1015-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261545 (https://phabricator.wikimedia.org/T420280) (owner: 10Bartosz Dziewoński) [20:04:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261470 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [20:05:25] (03Merged) 10jenkins-bot: Enable $wgTempCategoryCollations for testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261470 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [20:05:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding restbase2039 to codfw - jhancock@cumin2002" [20:05:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding restbase2039 to codfw - jhancock@cumin2002" [20:05:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:06:13] (03Merged) 10jenkins-bot: Wrap 'centralauthtoken' in a JWT [extensions/CentralAuth] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1261545 (https://phabricator.wikimedia.org/T420280) (owner: 10Bartosz Dziewoński) [20:06:29] !log kamila@deploy1003 Started scap sync-world: Backport for [[gerrit:1261545|Wrap 'centralauthtoken' in a JWT (T420280)]], [[gerrit:1261470|Enable $wgTempCategoryCollations for testwiki. (T419274 T419049)]] [20:06:37] T420280: Authenticated cross-origin requests are being throttled as if unauthenticated - https://phabricator.wikimedia.org/T420280 [20:06:38] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [20:06:38] T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049 [20:07:29] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host restbase2039 [20:08:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [20:08:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host restbase2039 [20:08:47] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on P{aqs1015.eqiad.wmnet} and P{P:Cassandra} [20:08:49] ACKNOWLEDGEMENT - MD RAID on aqs1015 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T421439 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:08:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439 (10ops-monitoring-bot) 03NEW [20:09:07] FIRING: [4x] ProbeDown: Service aqs1015-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:11:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:12:46] MatmaRex: will you want to test your change on mwdebug, once we get to that? [20:13:01] yeah [20:13:30] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:14:07] RESOLVED: [4x] ProbeDown: Service aqs1015-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:15:05] (03PS1) 10Dzahn: zuul: mariadb+pymysql instead of mariadb+mariadbconnector for DB connection [puppet] - 10https://gerrit.wikimedia.org/r/1261567 [20:20:01] (03CR) 10Alex Paskulin: [C:03+1] InitialiseSettings: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261511 (https://phabricator.wikimedia.org/T421413) (owner: 10Reedy) [20:20:33] jouncebot: nowandnext [20:20:33] For the next 0 hour(s) and 39 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T2000) [20:20:33] In 0 hour(s) and 39 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T2100) [20:20:38] excellent [20:22:11] (03PS2) 10Dzahn: zuul: mariadb+pymysql instead of mysql+pymysql for DB connection [puppet] - 10https://gerrit.wikimedia.org/r/1261567 [20:25:05] !log kamila@deploy1003 matmarex, kamila: Backport for [[gerrit:1261545|Wrap 'centralauthtoken' in a JWT (T420280)]], [[gerrit:1261470|Enable $wgTempCategoryCollations for testwiki. (T419274 T419049)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:25:14] T420280: Authenticated cross-origin requests are being throttled as if unauthenticated - https://phabricator.wikimedia.org/T420280 [20:25:14] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [20:25:15] T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049 [20:25:35] nice. testing [20:29:59] Raine: done, looks good. sorry i took a while [20:30:07] !log kamila@deploy1003 matmarex, kamila: Continuing with sync [20:30:18] np, continuing [20:30:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:10] (03CR) 10Dduvall: [C:03+1] zuul: mariadb+pymysql instead of mysql+pymysql for DB connection [puppet] - 10https://gerrit.wikimedia.org/r/1261567 (owner: 10Dzahn) [20:36:04] (03PS1) 10Andrew Bogott: Trove guest-agent: update postgresql and mariadb backup versions [puppet] - 10https://gerrit.wikimedia.org/r/1261579 (https://phabricator.wikimedia.org/T420737) [20:37:59] MatmaRex: oh, you are done... I was curious how you were going to test it! Nice to hear that it worked :) [20:39:21] duesen: just did some API requests with Special:ApiSandbox, and with mw.ForeignApi [20:42:36] Nice! So we can deploy the relevant patch for the gateway on monday. [20:42:48] 10ops-eqiad, 06DC-Ops: Discrepancy for wikikube-worker[1360-1372] - https://phabricator.wikimedia.org/T421442 (10wiki_willy) 03NEW [20:43:37] (03CR) 10Stoyofuku-wmf: "Thank _you_!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T414368) (owner: 10LorenMora) [20:44:01] !log kamila@deploy1003 Finished scap sync-world: Backport for [[gerrit:1261545|Wrap 'centralauthtoken' in a JWT (T420280)]], [[gerrit:1261470|Enable $wgTempCategoryCollations for testwiki. (T419274 T419049)]] (duration: 37m 32s) [20:44:09] T420280: Authenticated cross-origin requests are being throttled as if unauthenticated - https://phabricator.wikimedia.org/T420280 [20:44:10] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [20:44:10] T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049 [20:44:13] well that took a while :D [20:44:25] Can I go next? [20:44:50] go for it SCardenasM [20:44:53] Thanks! [20:45:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by suecarmol@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) (owner: 10Scardenasmolinar) [20:45:27] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1261545 had localisation changes... so yeah, would've taken a while :( [20:46:34] (03Merged) 10jenkins-bot: PersonalDashboard: Add config for Active Discussions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) (owner: 10Scardenasmolinar) [20:46:39] (03PS45) 10Tiziano Fogli: sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) [20:46:48] !log suecarmol@deploy1003 Started scap sync-world: Backport for [[gerrit:1256498|PersonalDashboard: Add config for Active Discussions (T420785)]] [20:46:53] T420785: Add Personal Dashboard Active Discussions configuration for pilot wikis - https://phabricator.wikimedia.org/T420785 [20:50:57] !log suecarmol@deploy1003 suecarmol: Backport for [[gerrit:1256498|PersonalDashboard: Add config for Active Discussions (T420785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:51:25] Testing... [20:54:02] !log suecarmol@deploy1003 suecarmol: Continuing with sync [20:54:17] Continuing with sync [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260326T2100) [21:00:41] !log suecarmol@deploy1003 Finished scap sync-world: Backport for [[gerrit:1256498|PersonalDashboard: Add config for Active Discussions (T420785)]] (duration: 13m 53s) [21:00:47] T420785: Add Personal Dashboard Active Discussions configuration for pilot wikis - https://phabricator.wikimedia.org/T420785 [21:02:32] (03PS1) 10Eevans: cassandra_dev: temporarily install docker.io package [puppet] - 10https://gerrit.wikimedia.org/r/1261597 (https://phabricator.wikimedia.org/T421444) [21:02:54] Done! [21:04:50] (03CR) 10Eevans: [C:03+2] cassandra_dev: temporarily install docker.io package [puppet] - 10https://gerrit.wikimedia.org/r/1261597 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans) [21:12:31] (03PS1) 10Eevans: cassandra_dev: docker.io package requires apparmor be installed [puppet] - 10https://gerrit.wikimedia.org/r/1261602 (https://phabricator.wikimedia.org/T421444) [21:12:36] (03CR) 10Bking: [C:03+1] query_service: add prom metrics for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1260558 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:13:47] (03CR) 10Ryan Kemper: [C:03+2] query_service: add prom metrics for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1260558 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:14:44] (03CR) 10Eevans: [C:03+2] cassandra_dev: docker.io package requires apparmor be installed [puppet] - 10https://gerrit.wikimedia.org/r/1261602 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans) [21:20:25] (03CR) 10Reedy: [C:03+2] Add Logstash logging for successful passwordless logins [extensions/OATHAuth] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1260834 (owner: 10Catrope) [21:20:58] (03CR) 10Reedy: [C:03+2] InitialiseSettings: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261511 (https://phabricator.wikimedia.org/T421413) (owner: 10Reedy) [21:21:54] (03Merged) 10jenkins-bot: InitialiseSettings: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261511 (https://phabricator.wikimedia.org/T421413) (owner: 10Reedy) [21:24:23] (03Merged) 10jenkins-bot: Add Logstash logging for successful passwordless logins [extensions/OATHAuth] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1260834 (owner: 10Catrope) [21:25:25] FIRING: [5x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:23] Oh thanks Reedy, I totally forgot I had a patch lined up for today [21:26:26] heh [21:26:48] (03PS1) 10Ryan Kemper: Revert "query_service: add prom metrics for auto-restart" [puppet] - 10https://gerrit.wikimedia.org/r/1261606 [21:26:57] (03CR) 10Ryan Kemper: [C:03+2] Revert "query_service: add prom metrics for auto-restart" [puppet] - 10https://gerrit.wikimedia.org/r/1261606 (owner: 10Ryan Kemper) [21:26:59] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] Revert "query_service: add prom metrics for auto-restart" [puppet] - 10https://gerrit.wikimedia.org/r/1261606 (owner: 10Ryan Kemper) [21:27:05] RoanKattouw: heh. I wanted to get the patch removing api portal deployed, reading don't seem to be using the window... and as the other patches ran long due to localisation updates, I thought I'd double check if anything was left :) [21:27:44] * Reedy logs into the correct deployment host [21:28:20] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1260834|Add Logstash logging for successful passwordless logins]], [[gerrit:1261511|InitialiseSettings: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis (T421413)]] [21:28:26] T421413: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis? - https://phabricator.wikimedia.org/T421413 [21:29:14] Thanks, you just saved me 96 hours of missing data [21:29:37] party time [21:30:06] !log reedy@deploy1003 catrope, reedy: Backport for [[gerrit:1260834|Add Logstash logging for successful passwordless logins]], [[gerrit:1261511|InitialiseSettings: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis (T421413)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:30:33] RoanKattouw: Do you care about testing? I'm happy to just deploy it through [21:31:07] !log reedy@deploy1003 catrope, reedy: Continuing with sync [21:34:12] (03PS2) 10Eevans: charts/cassandra-http-gateway: configurable Cassandra keyspace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259188 (https://phabricator.wikimedia.org/T414112) [21:34:12] (03PS10) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [21:35:19] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1260834|Add Logstash logging for successful passwordless logins]], [[gerrit:1261511|InitialiseSettings: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis (T421413)]] (duration: 06m 58s) [21:35:23] T421413: Remove apiportalwiki from $wmgCentralAuthAutoLoginWikis? - https://phabricator.wikimedia.org/T421413 [21:42:28] (03PS1) 10JHathaway: run_ci_locally: add nounset, cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1261610 [21:43:09] (03PS3) 10Bartosz Dziewoński: rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) [21:44:51] (03PS1) 10Eevans: cassandra: add linked_artifacts table grants [puppet] - 10https://gerrit.wikimedia.org/r/1261612 (https://phabricator.wikimedia.org/T420991) [21:45:52] (03CR) 10Bartosz Dziewoński: [C:03+1] "Thanks for updating the tests. Tagged with T419796 so we don't lose this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński) [21:45:55] RESOLVED: [5x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:41] (03PS2) 10Eevans: cassandra: add linked_artifacts table grants [puppet] - 10https://gerrit.wikimedia.org/r/1261612 (https://phabricator.wikimedia.org/T420991) [21:47:58] (03CR) 10Eevans: [C:03+2] cassandra: add linked_artifacts table grants [puppet] - 10https://gerrit.wikimedia.org/r/1261612 (https://phabricator.wikimedia.org/T420991) (owner: 10Eevans)