[00:08:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196197 [00:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196197 (owner: 10TrainBranchBot) [00:31:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196197 (owner: 10TrainBranchBot) [00:32:57] (03CR) 10Ottomata: [C:03+1] [mw-enrichment] Bump page change schema to 1.3.0 to pick up user_central_id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196179 (https://phabricator.wikimedia.org/T401725) (owner: 10TChin) [00:46:06] (03CR) 10CDanis: haproxy: add JA4H support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [00:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:50] (03PS1) 10HMonroy: Make tags be links to wish-index with filter applied [extensions/CommunityRequests] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196198 (https://phabricator.wikimedia.org/T406719) [01:00:57] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:13:09] (03CR) 10MusikAnimal: [C:03+2] Make tags be links to wish-index with filter applied [extensions/CommunityRequests] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196198 (https://phabricator.wikimedia.org/T406719) (owner: 10HMonroy) [01:14:03] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 06s) [01:14:45] (03Merged) 10jenkins-bot: Make tags be links to wish-index with filter applied [extensions/CommunityRequests] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196198 (https://phabricator.wikimedia.org/T406719) (owner: 10HMonroy) [01:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:16] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1196198|Make tags be links to wish-index with filter applied (T406719)]] [01:33:20] T406719: Tags should be links - https://phabricator.wikimedia.org/T406719 [01:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:35:53] !log musikanimal@deploy2002 hmonroy, musikanimal: Backport for [[gerrit:1196198|Make tags be links to wish-index with filter applied (T406719)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:36:17] !log musikanimal@deploy2002 hmonroy, musikanimal: Continuing with sync [01:40:41] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196198|Make tags be links to wish-index with filter applied (T406719)]] (duration: 07m 25s) [01:40:45] T406719: Tags should be links - https://phabricator.wikimedia.org/T406719 [02:06:35] (03CR) 10Scott French: [C:03+1] haproxy: add JA4H support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [02:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:42:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:32] fceratto@cumin1002 clone_es (PID 3887543) is awaiting input [04:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:08:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:27:50] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1031 gradually with 4 steps - Pool es1031.eqiad.wmnet in after cloning [05:27:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1032 gradually with 4 steps - Pool es1032.eqiad.wmnet in after cloning [05:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:42] (03PS1) 10Marostegui: db2248: Make a note about 1P host [puppet] - 10https://gerrit.wikimedia.org/r/1196214 [05:32:12] (03CR) 10Marostegui: [C:03+2] db2248: Make a note about 1P host [puppet] - 10https://gerrit.wikimedia.org/r/1196214 (owner: 10Marostegui) [05:34:32] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:35:19] (03PS1) 10Marostegui: mariadb: Productionize db2248 [puppet] - 10https://gerrit.wikimedia.org/r/1196215 (https://phabricator.wikimedia.org/T406551) [05:37:31] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2248 [puppet] - 10https://gerrit.wikimedia.org/r/1196215 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [05:40:05] (03PS1) 10Marostegui: site.pp: Remove external store hosts from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1196217 (https://phabricator.wikimedia.org/T406488) [05:40:47] (03CR) 10Marostegui: [C:03+2] site.pp: Remove external store hosts from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1196217 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:43:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2206.codfw.wmnet onto db2248.codfw.wmnet [05:43:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db2206 - Depool db2206.codfw.wmnet to then clone it to db2248.codfw.wmnet - marostegui@cumin1003 [05:43:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2206 - Depool db2206.codfw.wmnet to then clone it to db2248.codfw.wmnet - marostegui@cumin1003 [05:46:23] (03PS1) 10Marostegui: mariadb: Productionize db1261 [puppet] - 10https://gerrit.wikimedia.org/r/1196218 (https://phabricator.wikimedia.org/T406550) [05:47:12] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1261 [puppet] - 10https://gerrit.wikimedia.org/r/1196218 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [05:49:49] (03Abandoned) 10Marostegui: mariadb: Enable ssl when using profile::mariadb::client [puppet] - 10https://gerrit.wikimedia.org/r/672728 (owner: 10Kormat) [05:51:43] (03PS1) 10Marostegui: instances.yaml: Add db1260 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196219 (https://phabricator.wikimedia.org/T406550) [05:52:23] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1260 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196219 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [05:54:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db1260 to dbctl depooled T406550', diff saved to https://phabricator.wikimedia.org/P83886 and previous config saved to /var/cache/conftool/dbconfig/20251015-055457-marostegui.json [05:55:02] T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550 [05:55:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1260.eqiad.wmnet onto db1261.eqiad.wmnet [05:59:11] (03PS1) 10Marostegui: instances.yaml: Add es1052 and es1057 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196221 (https://phabricator.wikimedia.org/T406488) [05:59:41] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1052 and es1057 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1196221 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T0600) [06:02:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add es1052 and es1057 to dbctl depooled T406488', diff saved to https://phabricator.wikimedia.org/P83889 and previous config saved to /var/cache/conftool/dbconfig/20251015-060210-marostegui.json [06:02:15] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:02:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83890 and previous config saved to /var/cache/conftool/dbconfig/20251015-060234-root.json [06:02:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83891 and previous config saved to /var/cache/conftool/dbconfig/20251015-060240-root.json [06:03:19] (03PS1) 10Marostegui: es1052,es1057: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196224 (https://phabricator.wikimedia.org/T406488) [06:03:54] (03CR) 10Marostegui: [C:03+2] es1052,es1057: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196224 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [06:04:13] (03CR) 10Slyngshede: [C:03+2] P:idp update CAS configuration for 7.2.X [puppet] - 10https://gerrit.wikimedia.org/r/1195655 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [06:10:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11275186 (10Jclark-ctr) 05Resolved→03Open a:05VRiley-WMF→03None [06:13:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1031 gradually with 4 steps - Pool es1031.eqiad.wmnet in after cloning [06:13:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1031.eqiad.wmnet onto es1054.eqiad.wmnet [06:13:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1032 gradually with 4 steps - Pool es1032.eqiad.wmnet in after cloning [06:13:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1032.eqiad.wmnet onto es1055.eqiad.wmnet [06:13:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11275193 (10Marostegui) >>! In T405942#11273802, @RobH wrote: > Updating https://docs.google.com/spreadsheets/d/13ow4JxrsQdz8KSsdBBNwvlrAuGKo8OHWcnR4RhXTYc... [06:17:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83894 and previous config saved to /var/cache/conftool/dbconfig/20251015-061740-root.json [06:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:17:45] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:17:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83895 and previous config saved to /var/cache/conftool/dbconfig/20251015-061746-root.json [06:19:09] (03Abandoned) 10Slyngshede: C:netbox: Allow NDA group to access Netbox. [puppet] - 10https://gerrit.wikimedia.org/r/1070563 (https://phabricator.wikimedia.org/T373702) (owner: 10Slyngshede) [06:23:46] (03PS1) 10Slyngshede: IDP: Move production to Trixie [dns] - 10https://gerrit.wikimedia.org/r/1196226 (https://phabricator.wikimedia.org/T406455) [06:32:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83896 and previous config saved to /var/cache/conftool/dbconfig/20251015-063246-root.json [06:32:50] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:32:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83897 and previous config saved to /var/cache/conftool/dbconfig/20251015-063252-root.json [06:34:11] (03CR) 10Muehlenhoff: [C:03+1] "\o/ Login to 1005 works for me." [dns] - 10https://gerrit.wikimedia.org/r/1196226 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [06:38:21] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for jq [puppet] - 10https://gerrit.wikimedia.org/r/1196107 (owner: 10Muehlenhoff) [06:42:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:02] (03CR) 10Muehlenhoff: [C:03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1196078 (owner: 10Muehlenhoff) [06:45:13] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] interface: only bring down existing tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1195192 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [06:45:18] !log jmm@dns1004 START - running authdns-update [06:46:12] jouncebot: nowandnext [06:46:12] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T0600) [06:46:12] In 0 hour(s) and 13 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T0700) [06:46:32] !log jmm@dns1004 END - running authdns-update [06:47:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83898 and previous config saved to /var/cache/conftool/dbconfig/20251015-064752-root.json [06:47:56] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:47:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83899 and previous config saved to /var/cache/conftool/dbconfig/20251015-064758-root.json [06:48:57] (03CR) 10Filippo Giunchedi: [V:03+1] "Interesting, TIL! Thank you for the pointer" [puppet] - 10https://gerrit.wikimedia.org/r/1195194 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [06:49:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:53:21] (03PS2) 10Filippo Giunchedi: interface: add pre_down_command define [puppet] - 10https://gerrit.wikimedia.org/r/1195193 (https://phabricator.wikimedia.org/T405478) [06:53:21] (03PS2) 10Filippo Giunchedi: interface: del route on interface down [puppet] - 10https://gerrit.wikimedia.org/r/1195194 (https://phabricator.wikimedia.org/T405478) [06:53:21] (03PS3) 10Filippo Giunchedi: cloudceph: handle double / single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) [06:53:28] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:59:27] :( [07:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T0700). [07:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:33] hallo [07:00:50] I'll start deploying my patch [07:02:15] if gerrit is online :/ [07:02:28] PROBLEM - gerrit process on gerrit1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [07:02:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83900 and previous config saved to /var/cache/conftool/dbconfig/20251015-070258-root.json [07:03:03] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:03:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83901 and previous config saved to /var/cache/conftool/dbconfig/20251015-070304-root.json [07:04:16] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:04:28] RECOVERY - gerrit process on gerrit1003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [07:04:31] FIRING: [4x] ProbeDown: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:05:12] back for me, for now! [07:06:23] the alerts firing don't seem promising [07:06:30] hashar: are we ok to run backports? [07:06:33] The Gerrit slowness and brief downtime has been happening a lot lately (before the recent upgrade, too) [07:07:06] kostajh: see _security [07:07:11] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:07:14] bah [07:07:26] kostajh: gerrit has troubles right now [07:07:37] some scrapper is filling the Apache workers [07:07:42] godog: thanks [07:08:24] yes thank you [07:08:28] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:16] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:09:29] I was going to say, I did a backport earlier and then this same thing happened, but it finished just fine. I also uploaded a new patch during that time without issue, it was only the UI that was slow [07:09:31] RESOLVED: [4x] ProbeDown: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:10:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11275268 (10elukey) >>! In T406656#11273272, @Dzahn wrote: > I think if the problem statement includes "I don't have any special knowledg... [07:12:08] (03Abandoned) 10Elukey: Remove a deprecation warning for datetime in _menu.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/1194213 (https://phabricator.wikimedia.org/T401581) (owner: 10Elukey) [07:13:15] (03CR) 10Majavah: [C:03+1] interface: add pre_down_command define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195193 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [07:14:29] (03CR) 10Filippo Giunchedi: [C:03+2] interface: add pre_down_command define [puppet] - 10https://gerrit.wikimedia.org/r/1195193 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [07:14:33] (03PS1) 10Awight: [beta] Enable subref merge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196265 (https://phabricator.wikimedia.org/T385666) [07:16:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org [07:16:08] (03CR) 10Majavah: [C:03+1] interface: del route on interface down [puppet] - 10https://gerrit.wikimedia.org/r/1195194 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [07:17:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195409 (https://phabricator.wikimedia.org/T402366) (owner: 10Kosta Harlan) [07:18:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83903 and previous config saved to /var/cache/conftool/dbconfig/20251015-071803-root.json [07:18:08] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:18:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83904 and previous config saved to /var/cache/conftool/dbconfig/20251015-071810-root.json [07:18:12] (03Merged) 10jenkins-bot: hCaptcha: Enable on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195409 (https://phabricator.wikimedia.org/T402366) (owner: 10Kosta Harlan) [07:18:46] (03PS1) 10Arnaudb: gerrit: mod_qos tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1196327 (https://phabricator.wikimedia.org/T407312) [07:18:50] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1195409|hCaptcha: Enable on enwiki (T402366)]] [07:18:54] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:18:55] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196327 (https://phabricator.wikimedia.org/T407312) (owner: 10Arnaudb) [07:20:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org [07:21:21] (03CR) 10Filippo Giunchedi: [C:03+2] interface: del route on interface down [puppet] - 10https://gerrit.wikimedia.org/r/1195194 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [07:21:23] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1195409|hCaptcha: Enable on enwiki (T402366)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:23:29] !log kharlan@deploy2002 kharlan: Continuing with sync [07:25:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:27:52] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195409|hCaptcha: Enable on enwiki (T402366)]] (duration: 09m 02s) [07:27:57] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:28:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2032 gradually with 4 steps - Pool es2032.codfw.wmnet in after cloning [07:28:28] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:33:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83906 and previous config saved to /var/cache/conftool/dbconfig/20251015-073309-root.json [07:33:14] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:33:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83907 and previous config saved to /var/cache/conftool/dbconfig/20251015-073316-root.json [07:33:28] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:34:11] (03PS1) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [07:34:13] (03CR) 10Arnaudb: [C:03+2] gerrit: mod_qos tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1196327 (https://phabricator.wikimedia.org/T407312) (owner: 10Arnaudb) [07:34:23] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:34:23] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:41] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7275/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [07:35:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:37:10] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:48] (03PS2) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [07:41:48] (03PS1) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 [07:42:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1195625 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [07:42:35] (03CR) 10CI reject: [V:04-1] interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah) [07:42:55] (03CR) 10Slyngshede: [C:03+2] P:idp add the Trixie hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1195625 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [07:43:49] (03PS2) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 [07:43:49] (03PS3) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [07:44:19] (03CR) 10CI reject: [V:04-1] interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah) [07:44:31] (03CR) 10CI reject: [V:04-1] P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [07:46:32] (03CR) 10CI reject: [V:04-1] P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [07:46:42] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] [beta] Enable subref merge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196265 (https://phabricator.wikimedia.org/T385666) (owner: 10Awight) [07:47:56] (03PS3) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 [07:47:56] (03PS4) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [07:48:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83909 and previous config saved to /var/cache/conftool/dbconfig/20251015-074815-root.json [07:48:20] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:48:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83910 and previous config saved to /var/cache/conftool/dbconfig/20251015-074821-root.json [07:48:40] (03CR) 10CI reject: [V:04-1] interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah) [07:49:45] (03PS4) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 [07:49:45] (03PS5) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [07:50:30] (03CR) 10CI reject: [V:04-1] interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah) [07:50:32] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2055.codfw.wmnet'] [07:50:48] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2055.codfw.wmnet'] [07:50:56] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2054.codfw.wmnet'] [07:51:38] (03PS5) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 [07:51:38] (03PS6) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [07:53:11] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe [07:53:37] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7277/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [07:53:43] hashar: I'm planning to sneak a beta config out, if that's not too disruptive to the train? [07:54:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196265 (https://phabricator.wikimedia.org/T385666) (owner: 10Awight) [07:55:23] awight_: iirc you can just +2 it and scap in prod will happilly skip deploying it [07:55:27] if it only applies to beta [07:57:06] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2054.codfw.wmnet'] [07:57:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11275363 (10cmooney) >>! In T405499#11273763, @ssingh wrote: > FWIW we have typically reimaged for this in the past. I am not suggesting, just sharin... [07:57:42] (03PS1) 10Marostegui: db1248: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1196370 [07:58:12] hashar: oho I try that now [07:58:14] (03CR) 10Marostegui: [C:03+2] db1248: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1196370 (owner: 10Marostegui) [07:58:18] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2053.codfw.wmnet'] [07:58:55] (03CR) 10Awight: [C:03+2] "Deploying to beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196265 (https://phabricator.wikimedia.org/T385666) (owner: 10Awight) [07:59:46] (03Merged) 10jenkins-bot: [beta] Enable subref merge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196265 (https://phabricator.wikimedia.org/T385666) (owner: 10Awight) [07:59:57] hashar: In the past the standard was to run scap anyway, so that the next deployer doesn't have to see it. But IIUC you suggest not running scap, true? [08:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T0800) [08:00:38] I step back now, just lmk if this caused any problems! [08:00:57] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11275368 (10elukey) @TheDJ one weird thing that is happening now: I cannot reproduce anymore the long stalling. Could you please recheck? [08:01:29] (03CR) 10MVernon: [C:03+1] "LGTM :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196119 (owner: 10BCornwall) [08:01:29] awight_: no idea, I think you should run it anyway [08:01:29] :) [08:01:39] scap backport would pull it [08:01:47] kk yes I do that then [08:02:19] makes sense. [08:02:46] !log Moving CAS/IDP/SSO to Trixie. [08:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:53] (03CR) 10Slyngshede: [C:03+2] IDP: Move production to Trixie [dns] - 10https://gerrit.wikimedia.org/r/1196226 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:02:53] hashar: you were right: it ran and happily finished immediately :-) [08:03:05] !log slyngshede@dns1004 START - running authdns-update [08:03:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83912 and previous config saved to /var/cache/conftool/dbconfig/20251015-080321-root.json [08:03:28] great! [08:03:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83913 and previous config saved to /var/cache/conftool/dbconfig/20251015-080327-root.json [08:03:36] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [08:03:41] I am going to check the over night logs and see whether anything might block the train [08:03:49] then go ahead and promote group 1 wikis to wmf.23 [08:04:00] * hashar fasten its seat belt [08:04:24] !log slyngshede@dns1004 END - running authdns-update [08:04:39] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2053.codfw.wmnet'] [08:04:59] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2052.codfw.wmnet'] [08:05:32] trains don't have seat belts usually [08:06:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11275382 (10klausman) ml-cache1002 can be done anytime, it just needs an Icinga/Prometheus downtime. The two ml-serve machines can be done anytime during CET da... [08:07:43] taavi: except for the conductor! :b [08:09:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org [08:10:32] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196371 (https://phabricator.wikimedia.org/T405679) [08:10:34] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196371 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [08:11:20] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196371 (https://phabricator.wikimedia.org/T405679) (owner: 10TrainBranchBot) [08:13:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org [08:13:52] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2032 gradually with 4 steps - Pool es2032.codfw.wmnet in after cloning [08:13:53] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2032.codfw.wmnet onto es2053.codfw.wmnet [08:13:54] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2052.codfw.wmnet'] [08:14:29] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2052.codfw.wmnet'] [08:14:32] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp2052.codfw.wmnet'] [08:14:38] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2051.codfw.wmnet'] [08:18:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83915 and previous config saved to /var/cache/conftool/dbconfig/20251015-081827-root.json [08:18:32] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [08:18:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83916 and previous config saved to /var/cache/conftool/dbconfig/20251015-081833-root.json [08:19:37] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.23 refs T405679 [08:19:41] T405679: 1.45.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T405679 [08:22:06] hmm [08:22:20] poolcounter log rate is raising [08:22:21] I am checking [08:22:27] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2051.codfw.wmnet'] [08:22:57] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2050.codfw.wmnet'] [08:25:09] that was a one time spike, maybe due to resource loader refreshing the caches [08:25:12] it is gone [08:25:58] the error rate has raised cause of a Deprecation warning which I have filed yesterday and we deemed it to not be of any importance [08:29:57] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2050.codfw.wmnet'] [08:30:48] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [08:33:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1052 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83917 and previous config saved to /var/cache/conftool/dbconfig/20251015-083333-root.json [08:33:38] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [08:33:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1057 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83918 and previous config saved to /var/cache/conftool/dbconfig/20251015-083339-root.json [08:34:45] elukey@cumin2002 upgrade-firmware (PID 1847334) is awaiting input [08:36:45] (03PS4) 10Filippo Giunchedi: cloudceph: handle double / single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) [08:36:45] (03PS1) 10Filippo Giunchedi: wmcs: introduce cloud_storage_subnet variables [puppet] - 10https://gerrit.wikimedia.org/r/1196372 (https://phabricator.wikimedia.org/T405478) [08:38:08] (03CR) 10Filippo Giunchedi: cloudceph: handle double / single NIC transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [08:40:03] (03CR) 10CI reject: [V:04-1] cloudceph: handle double / single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [08:41:22] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [08:41:49] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:42:10] (03CR) 10Majavah: wmcs: introduce cloud_storage_subnet variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196372 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [08:44:34] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe [08:46:05] (03PS1) 10Hashar: Replace call to deprecated method getImages [extensions/GlobalUsage] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196375 (https://phabricator.wikimedia.org/T407184) [08:47:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2206 gradually with 4 steps - Pool db2206.codfw.wmnet in after cloning [08:47:30] (03CR) 10Hashar: "On second though, the deprecation warning is high enough that it might end up hiding other kind of errors, hence this backport. See also t" [extensions/GlobalUsage] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196375 (https://phabricator.wikimedia.org/T407184) (owner: 10Hashar) [08:47:33] (03PS13) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [08:47:34] (03PS13) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [08:47:34] (03PS12) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [08:48:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:ae0 (External: IX.BR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:49:33] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [08:49:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:49:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11275513 (10Clement_Goubert) I'm so sorry I haven't got around to it. Doing it now. [08:50:41] the mw train looks quiet so far [08:50:51] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [08:51:01] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2047.codfw.wmnet'] [08:51:17] there are a bunch of PHP deprecation warnings going on, I will see to have them muted by having a fix backported/deployed ( https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalUsage/+/1196375 ) [08:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:52:55] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [08:57:40] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2047.codfw.wmnet'] [08:58:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11275543 (10Clement_Goubert) Done for all wikikube-worker and wikikube-ctrl. I can make myself available when you do it, or you can ping anyone from the team, I'll brief them on... [08:59:50] !log mwscript-k8s -- purgeUserOptions.php --wiki=loginwiki (T406724) [08:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:54] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [09:01:04] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2046.codfw.wmnet'] [09:01:17] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2046.codfw.wmnet'] [09:04:20] (03CR) 10Filippo Giunchedi: wmcs: introduce cloud_storage_subnet variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196372 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [09:06:50] (03PS5) 10Filippo Giunchedi: cloudceph: handle double / single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) [09:07:45] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11275561 (10Ladsgroup) [09:07:48] 06SRE, 06Infrastructure-Foundations: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11275562 (10MoritzMuehlenhoff) Most of these have underlying technical reasons, I'll defer to the Data Platform SREs if they want to reorganise the access or not [09:07:54] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11275565 (10MoritzMuehlenhoff) [09:08:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:ae0 (External: IX.BR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:11:35] (03CR) 10Vgutierrez: haproxy: add JA4H support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [09:13:00] 06SRE, 10SRE-SLO, 10Citoid, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11275577 (10Mvolz) [09:13:09] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407130 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196379 (https://phabricator.wikimedia.org/T395443) [09:13:11] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T407130 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196379 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [09:13:56] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:14:03] (03CR) 10Jgiannelos: [C:03+1] Replace call to deprecated method getImages [extensions/GlobalUsage] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196375 (https://phabricator.wikimedia.org/T407184) (owner: 10Hashar) [09:14:15] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:16:36] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:17:21] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:17:43] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:18:16] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:20:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11275604 (10Ladsgroup) I confirm the key is not used in WMCS. [09:20:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1260 gradually with 4 steps - Pool db1260.eqiad.wmnet in after cloning [09:20:50] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator: Add logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904#11275609 (10MoritzMuehlenhoff) Simon is adding support to Bitu to have users link their username similar to one can currently link a SUL account (https://phabricato... [09:22:27] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11275618 (10Ladsgroup) The name of the LDAP account is wrong. https://ldap.toolforge.org/user/lsandergreen-wmf doesn't bring anything nor other variations. [09:23:20] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11275622 (10Ladsgroup) [09:23:33] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11275623 (10Ladsgroup) ` ladsgroup@ldap-maint1001:~$ ldapsearch -x mail=lsandergreen@wikimedia.org # extended LDIF # # LDAPv3 # base (default) with scope subtree #... [09:25:56] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11275638 (10Ladsgroup) Confirming the ssh key is not used in WMCS. [09:26:26] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11275639 (10Ladsgroup) [09:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:45] (03PS14) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [09:28:45] (03PS14) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [09:28:46] (03PS13) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [09:29:27] (03CR) 10Zabe: [C:03+1] "The error rate is like ~37000 / hour now and will go up once we get to group2." [extensions/GlobalUsage] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196375 (https://phabricator.wikimedia.org/T407184) (owner: 10Hashar) [09:30:24] (03PS4) 10Muehlenhoff: installserver: Drop support for legacy atftpd startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) [09:30:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:30:46] (03CR) 10Hashar: "Yup, Yiannis and I are chatting about it over private messages." [extensions/GlobalUsage] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196375 (https://phabricator.wikimedia.org/T407184) (owner: 10Hashar) [09:31:41] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2045.codfw.wmnet'] [09:31:49] (03Abandoned) 10Ladsgroup: MetaContactPages: Add affcom conflict reporting page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127958 (https://phabricator.wikimedia.org/T388919) (owner: 10Ladsgroup) [09:31:58] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2045.codfw.wmnet'] [09:32:06] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2043.codfw.wmnet'] [09:32:27] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2043.codfw.wmnet'] [09:32:42] (03CR) 10CI reject: [V:04-1] Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [09:32:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2206 gradually with 4 steps - Pool db2206.codfw.wmnet in after cloning [09:33:02] jouncebot: nowandnext [09:33:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2206.codfw.wmnet onto db2248.codfw.wmnet [09:33:02] For the next 0 hour(s) and 26 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T0800) [09:33:02] In 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1000) [09:34:05] hashar: will you use the rest of the window or may I do a quick backport? [09:34:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:35:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11275677 (10elukey) SSD firmwares updated on all cp hosts! So at this point we can try to reimage all hosts to trixie. For some reason cp2043 wasn't able to PX... [09:36:43] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [09:37:05] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS trixie [09:38:00] 06SRE, 10SRE-SLO, 10Citoid, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11275686 (10elukey) 05Open→03Resolved I had a chat with Marielle and we decided to close this task, and work on the success ra... [09:39:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:40:34] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11275721 (10elukey) I had a chat with @Mvolz today and we decided to proceed in this way:... [09:43:20] elukey@cumin1003 reimage (PID 2887271) is awaiting input [09:44:01] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS trixie [09:44:25] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2044.codfw.wmnet with OS trixie [09:48:55] (03PS3) 10Jcrespo: admin: Replace yubikey with one with a key handle stored on disk [puppet] - 10https://gerrit.wikimedia.org/r/1196085 [09:49:10] (03PS15) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [09:49:10] (03PS15) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [09:49:10] (03PS14) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [09:50:09] (03CR) 10Jcrespo: [C:03+2] admin: Replace yubikey with one with a key handle stored on disk [puppet] - 10https://gerrit.wikimedia.org/r/1196085 (owner: 10Jcrespo) [09:50:35] (03CR) 10Muehlenhoff: [C:03+2] installserver: Drop support for legacy atftpd startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:52:39] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11275789 (10Ladsgroup) L3 has been signed on 2016. @Volker_E Would you mind re-reading and resigning the newer version of L3? Thank you! [09:57:48] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [09:58:34] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2044.codfw.wmnet with reason: host reimage [09:59:42] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T395445 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196382 (https://phabricator.wikimedia.org/T395443) [09:59:44] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T395445 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196382 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [09:59:56] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407329 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196386 (https://phabricator.wikimedia.org/T395443) [09:59:58] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T407329 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196386 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1000) [10:00:14] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407330 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196387 (https://phabricator.wikimedia.org/T395443) [10:00:19] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T407330 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196387 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:00:35] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407331 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196390 (https://phabricator.wikimedia.org/T395443) [10:00:50] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T407331 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1196390 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:01:50] (03CR) 10Vgutierrez: haproxy: add JA4H support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [10:01:55] jouncebot: now [10:01:55] For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1000) [10:02:15] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1005.eqiad.wmnet [10:02:38] is the MediaWiki infrastructure window used for anything? I'd like to push a backport [10:02:40] to cut some log spam [10:02:54] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalUsage/+/1196375 :) [10:03:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2044.codfw.wmnet with reason: host reimage [10:06:54] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 (owner: 10Daniel Kinzler) [10:08:43] (03Merged) 10jenkins-bot: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 (owner: 10Daniel Kinzler) [10:09:33] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1005.eqiad.wmnet [10:09:37] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1006.eqiad.wmnet [10:09:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/GlobalUsage] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196375 (https://phabricator.wikimedia.org/T407184) (owner: 10Hashar) [10:11:22] (03Merged) 10jenkins-bot: Replace call to deprecated method getImages [extensions/GlobalUsage] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196375 (https://phabricator.wikimedia.org/T407184) (owner: 10Hashar) [10:11:54] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1196375|Replace call to deprecated method getImages (T407184)]] [10:12:00] T407184: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::getImages was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\GlobalUsage\Hooks::onLinksUpdateComplete] - https://phabricator.wikimedia.org/T407184 [10:13:18] (03CR) 10Muehlenhoff: [C:03+2] atftpd: Drop service definition [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [10:14:04] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:14:13] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:14:47] (03PS1) 10Ladsgroup: admin: Add Maria Lechner WMDE to LDAP only list [puppet] - 10https://gerrit.wikimedia.org/r/1196396 (https://phabricator.wikimedia.org/T406106) [10:16:11] !log hashar@deploy2002 hashar: Backport for [[gerrit:1196375|Replace call to deprecated method getImages (T407184)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:16:14] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1006.eqiad.wmnet [10:16:18] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1007.eqiad.wmnet [10:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:18:33] !log deleted legacy EMEA/Americas business hours Splunk rotations [10:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:14] (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good (once confirmation about signed NDA is in)" [puppet] - 10https://gerrit.wikimedia.org/r/1196396 (https://phabricator.wikimedia.org/T406106) (owner: 10Ladsgroup) [10:21:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2044.codfw.wmnet with OS trixie [10:21:36] hnowlan: \o/ [10:23:15] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1007.eqiad.wmnet [10:23:16] !log installing libcommons-lang3-java security updates [10:23:19] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1008.eqiad.wmnet [10:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:30] (03PS1) 10Federico Ceratto: instances.yaml, es2053.yaml: Prepare es2053 for production [puppet] - 10https://gerrit.wikimedia.org/r/1196397 (https://phabricator.wikimedia.org/T402859) [10:24:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11275939 (10elukey) @Jhancock.wm Hi! So I've reimaged cp2044 with Debian Trixie and everything went fine, we can proceed to reimage the rest with Trixie and see... [10:25:25] (03PS2) 10Ladsgroup: admin: Add Maria Lechner WMDE to LDAP only list [puppet] - 10https://gerrit.wikimedia.org/r/1196396 (https://phabricator.wikimedia.org/T406106) [10:25:31] (03CR) 10Ladsgroup: [V:03+2 C:03+2] admin: Add Maria Lechner WMDE to LDAP only list [puppet] - 10https://gerrit.wikimedia.org/r/1196396 (https://phabricator.wikimedia.org/T406106) (owner: 10Ladsgroup) [10:30:16] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1008.eqiad.wmnet [10:30:20] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1009.eqiad.wmnet [10:32:57] (03CR) 10Clément Goubert: [C:04-1] "Broken lua table" [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:37:04] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1009.eqiad.wmnet [10:38:53] (03PS6) 10Giuseppe Lavagetto: cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) [10:40:05] !log hashar@deploy2002 hashar: Continuing with sync [10:42:08] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "FTR for the NDA confirmation: https://phabricator.wikimedia.org/T405917#11274755" [puppet] - 10https://gerrit.wikimedia.org/r/1196396 (https://phabricator.wikimedia.org/T406106) (owner: 10Ladsgroup) [10:44:13] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196375|Replace call to deprecated method getImages (T407184)]] (duration: 32m 19s) [10:44:17] T407184: PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::getImages was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\GlobalUsage\Hooks::onLinksUpdateComplete] - https://phabricator.wikimedia.org/T407184 [10:49:41] I have completed the backport [10:54:29] (03CR) 10Brouberol: [C:03+2] airflow-ml: enable the triggerer component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196029 (https://phabricator.wikimedia.org/T406958) (owner: 10Brouberol) [10:55:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [10:55:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:00:05] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1100). [11:05:26] (03PS1) 10Cathal Mooney: Move addition of network device intermediate CA to separate file [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) [11:06:59] !log disabling puppet on cp nodes for 1195679: trafficserver: remove gateway-check group-specific routes for rest.php - T406318 [11:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:02] T406318: rest.php via rest-gateway production rollout - https://phabricator.wikimedia.org/T406318 [11:08:23] (03CR) 10Muehlenhoff: [C:03+2] Failover failoid in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1183101 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [11:08:27] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [11:10:48] (03CR) 10Brouberol: [C:03+1] Update the definition of @dse_kubepods_networks [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [11:10:56] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195221 (owner: 10PipelineBot) [11:10:59] (03CR) 10Clément Goubert: [C:03+2] trafficserver: remove gateway-check group-specific routes for rest.php [puppet] - 10https://gerrit.wikimedia.org/r/1195679 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [11:12:49] !log Enabling puppet on cp6015 for 1195679: trafficserver: remove gateway-check group-specific routes for rest.php - T406318 [11:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:54] T406318: rest.php via rest-gateway production rollout - https://phabricator.wikimedia.org/T406318 [11:13:12] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195221 (owner: 10PipelineBot) [11:14:58] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:15:16] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:16:08] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:16:52] !log Enabling puppet on all cp nodes for 1195679: trafficserver: remove gateway-check group-specific routes for rest.php - T406318 [11:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:58] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:17:31] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:18:57] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:19:39] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:20:15] (03PS1) 10Muehlenhoff: Assign failoid role to failoid1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196407 (https://phabricator.wikimedia.org/T402406) [11:21:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Create boot environment of Bullseye with a 6.1 kernel - https://phabricator.wikimedia.org/T405102#11276137 (10MoritzMuehlenhoff) >>! In T405102#11273708, @ssingh wrote: > Traffic discussed this in the team meeting today. We decided that given the above blocker,... [11:26:58] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:28:09] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [11:28:19] (03CR) 10Muehlenhoff: [C:03+2] Assign failoid role to failoid1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196407 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [11:31:12] (03PS1) 10Muehlenhoff: Failover failoid in eqiad to failoid1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196409 (https://phabricator.wikimedia.org/T402406) [11:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [11:41:28] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Naming makes sense to me based on the existing but also fine to change if you guys decide to change it." [puppet] - 10https://gerrit.wikimedia.org/r/1196372 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [11:41:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [11:47:49] Ok the trafficserver patch actually broke something but I'm not sure what, reverting [11:48:04] (03PS1) 10Clément Goubert: Revert "trafficserver: remove gateway-check group-specific routes for rest.php" [puppet] - 10https://gerrit.wikimedia.org/r/1196411 [11:48:30] (no outage, just we're not routing through the rest-gateway anymore for some unknown reason) [11:49:47] (03PS1) 10Btullis: Deploy the opensearch-operator to the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196412 (https://phabricator.wikimedia.org/T404907) [11:50:03] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2005.codfw.wmnet [11:51:15] (03CR) 10Clément Goubert: [C:03+2] Revert "trafficserver: remove gateway-check group-specific routes for rest.php" [puppet] - 10https://gerrit.wikimedia.org/r/1196411 (owner: 10Clément Goubert) [11:55:32] (03PS1) 10Marostegui: mariadb: Productionize db2247 [puppet] - 10https://gerrit.wikimedia.org/r/1196414 (https://phabricator.wikimedia.org/T406551) [11:56:20] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2247 [puppet] - 10https://gerrit.wikimedia.org/r/1196414 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [11:57:04] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2005.codfw.wmnet [11:57:08] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2006.codfw.wmnet [12:01:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2206.codfw.wmnet onto db2247.codfw.wmnet [12:02:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db2206 - Depool db2206.codfw.wmnet to then clone it to db2247.codfw.wmnet - marostegui@cumin1003 [12:02:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2206 - Depool db2206.codfw.wmnet to then clone it to db2247.codfw.wmnet - marostegui@cumin1003 [12:03:30] (03PS1) 10Marostegui: db1260: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196415 (https://phabricator.wikimedia.org/T406550) [12:04:00] (03CR) 10Marostegui: [C:03+2] db1260: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196415 (https://phabricator.wikimedia.org/T406550) (owner: 10Marostegui) [12:05:04] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2006.codfw.wmnet [12:05:08] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2007.codfw.wmnet [12:05:19] (03CR) 10Marostegui: [C:03+1] instances.yaml, es2053.yaml: Prepare es2053 for production [puppet] - 10https://gerrit.wikimedia.org/r/1196397 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:05:33] (03PS25) 10Daniel Kinzler: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [12:11:25] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1196409 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [12:12:00] (03CR) 10Brouberol: [C:03+1] Deploy the opensearch-operator to the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196412 (https://phabricator.wikimedia.org/T404907) (owner: 10Btullis) [12:12:08] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2007.codfw.wmnet [12:12:12] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2008.codfw.wmnet [12:13:06] (03PS1) 10Clément Goubert: trafficserver: Fix logic if groupmatch is empty [puppet] - 10https://gerrit.wikimedia.org/r/1196416 (https://phabricator.wikimedia.org/T406318) [12:13:33] (03PS1) 10Clément Goubert: Revert "trafficserver: remove gateway-check group-specific routes for rest.php" [puppet] - 10https://gerrit.wikimedia.org/r/1196417 [12:14:00] (03Abandoned) 10Clément Goubert: Revert "trafficserver: remove gateway-check group-specific routes for rest.php" [puppet] - 10https://gerrit.wikimedia.org/r/1196417 (owner: 10Clément Goubert) [12:14:35] (03PS1) 10Clément Goubert: Revert^2 "trafficserver: remove gateway-check group-specific routes for rest.php" [puppet] - 10https://gerrit.wikimedia.org/r/1196418 [12:17:19] (03CR) 10CI reject: [V:04-1] Revert^2 "trafficserver: remove gateway-check group-specific routes for rest.php" [puppet] - 10https://gerrit.wikimedia.org/r/1196418 (owner: 10Clément Goubert) [12:17:53] (03PS2) 10Clément Goubert: Revert^2 "trafficserver: remove gateway-check group routes for rest.php" [puppet] - 10https://gerrit.wikimedia.org/r/1196418 [12:18:49] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2008.codfw.wmnet [12:18:54] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2009.codfw.wmnet [12:24:21] (03CR) 10Muehlenhoff: [C:03+2] Failover failoid in eqiad to failoid1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196409 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [12:25:53] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2009.codfw.wmnet [12:26:32] !log installing ghostscript security updates [12:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:31] (03CR) 10Fabfur: [C:03+1] trafficserver: Fix logic if groupmatch is empty [puppet] - 10https://gerrit.wikimedia.org/r/1196416 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [12:27:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host parsoidtest1001.eqiad.wmnet [12:28:56] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Fix logic if groupmatch is empty [puppet] - 10https://gerrit.wikimedia.org/r/1196416 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [12:29:24] !log disabling puppet on cp nodes for T406318 [12:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:28] T406318: rest.php via rest-gateway production rollout - https://phabricator.wikimedia.org/T406318 [12:33:25] (03CR) 10Btullis: [C:03+2] Deploy the opensearch-operator to the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196412 (https://phabricator.wikimedia.org/T404907) (owner: 10Btullis) [12:33:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host parsoidtest1001.eqiad.wmnet [12:35:41] fabfur: ok looks good, I'll merge the revert of the config patch, test that, and then let puppet deploy both, sounds good? [12:35:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196418 < This is the revert^2 of the config patch [12:40:10] ack! [12:40:29] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "trafficserver: remove gateway-check group routes for rest.php" [puppet] - 10https://gerrit.wikimedia.org/r/1196418 (owner: 10Clément Goubert) [12:40:59] (03Merged) 10jenkins-bot: Deploy the opensearch-operator to the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196412 (https://phabricator.wikimedia.org/T404907) (owner: 10Btullis) [12:44:41] fabfur: everything looks good [12:44:49] !log enabling puppet on cp nodes for T406318 [12:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:54] T406318: rest.php via rest-gateway production rollout - https://phabricator.wikimedia.org/T406318 [12:45:35] great! [12:49:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1026 T407351', diff saved to https://phabricator.wikimedia.org/P83925 and previous config saved to /var/cache/conftool/dbconfig/20251015-124927-marostegui.json [12:49:31] T407351: decommission es1026.eqiad.wmnet - https://phabricator.wikimedia.org/T407351 [12:50:17] (03PS1) 10Marostegui: es1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196426 (https://phabricator.wikimedia.org/T407351) [12:50:21] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:51:56] (03CR) 10Marostegui: [C:03+2] es1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196426 (https://phabricator.wikimedia.org/T407351) (owner: 10Marostegui) [12:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:57:08] (03PS1) 10Muehlenhoff: Add a Prometheus exporter to monitor the validity of the internal Ganeti CA [puppet] - 10https://gerrit.wikimedia.org/r/1196430 (https://phabricator.wikimedia.org/T382902) [12:57:55] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11276521 (10elukey) > Split analytics-privatedata-users into 3 groups in data.yaml that have more precise names. Then folks can say precisely what flavor of it they are applyi... [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1300) [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:00:50] looks like there’s nothing to deploy [13:01:43] (03CR) 10TChin: [C:03+2] [mw-enrichment] Bump page change schema to 1.3.0 to pick up user_central_id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196179 (https://phabricator.wikimedia.org/T401725) (owner: 10TChin) [13:02:44] * Lucas_WMDE also in a meeting now, so if anything needs deploying, find someone else :) [13:02:47] (03PS1) 10Marostegui: mariadb: Productionize sretest2003 [puppet] - 10https://gerrit.wikimedia.org/r/1196431 (https://phabricator.wikimedia.org/T407352) [13:03:56] (03Merged) 10jenkins-bot: [mw-enrichment] Bump page change schema to 1.3.0 to pick up user_central_id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196179 (https://phabricator.wikimedia.org/T401725) (owner: 10TChin) [13:05:18] (03PS1) 10Marostegui: installserver: Reimage sretest2003 [puppet] - 10https://gerrit.wikimedia.org/r/1196432 (https://phabricator.wikimedia.org/T407352) [13:07:51] (03CR) 10Marostegui: [C:03+2] installserver: Reimage sretest2003 [puppet] - 10https://gerrit.wikimedia.org/r/1196432 (https://phabricator.wikimedia.org/T407352) (owner: 10Marostegui) [13:08:54] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize sretest2003 [puppet] - 10https://gerrit.wikimedia.org/r/1196431 (https://phabricator.wikimedia.org/T407352) (owner: 10Marostegui) [13:09:34] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11276577 (10Novem_Linguae) >>! In T405517#11276521, @elukey wrote: > Could we get some clarity about what is confusing about the current workflow? 1) The division of https://... [13:09:36] (03PS1) 10Brouberol: Move operator related common values away from services values and into admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196433 (https://phabricator.wikimedia.org/T404907) [13:10:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11276580 (10ssingh) >>! In T405499#11275363, @cmooney wrote: >>>! In T405499#11273763, @ssingh wrote: >> FWIW we have typically reimaged for this in... [13:10:43] (03CR) 10Btullis: [C:03+1] Move operator related common values away from services values and into admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196433 (https://phabricator.wikimedia.org/T404907) (owner: 10Brouberol) [13:12:37] (03CR) 10Elukey: "Left a suggestion, lemme know!" [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:14:05] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:14:12] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11276592 (10elukey) I was able to repro again for another tile: ` elukey@deploy1003:~$ time curl -i "https://kartotherian.svc.codfw.wmnet:6543/img/osm-intl,14,a,a,300x200.png?lang=en... [13:14:19] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:15:25] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_magru [13:15:35] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-text_magru [13:16:06] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: already rebooted; pooling] [13:16:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [13:16:44] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_magru and not P{cp7001*} and A:cp [13:17:12] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_magru [13:17:55] (03CR) 10Brouberol: [C:03+2] Move operator related common values away from services values and into admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196433 (https://phabricator.wikimedia.org/T404907) (owner: 10Brouberol) [13:18:07] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_drmrs [13:18:16] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_drmrs [13:19:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:19:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [13:20:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:23:59] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11276631 (10elukey) I discovered that we do log postgres SQL requests in /var/log/postgres that take more than 10s to complete. On maps2012. the majority look like the following: ` 2... [13:26:57] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1260 gradually with 4 steps - Pool db1260.eqiad.wmnet in after cloning [13:26:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1260.eqiad.wmnet onto db1261.eqiad.wmnet [13:27:06] (03PS1) 10Federico Ceratto: zarcillo: update egress after IDP ipaddr changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196437 (https://phabricator.wikimedia.org/T384810) [13:27:06] (03CR) 10Federico Ceratto: "As discussed on IRC" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196437 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [13:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:35] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7009.magru.wmnet [13:29:21] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7002.magru.wmnet [13:29:28] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6001.drmrs.wmnet [13:29:38] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6009.drmrs.wmnet [13:29:51] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-10-06-225918 to 2025-10-14-194525 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196439 (https://phabricator.wikimedia.org/T405130) [13:29:57] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-10-09-001812 to 2025-10-15-120631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196440 [13:30:24] (03Restored) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [13:31:18] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:31:26] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:31:36] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102#11276703 (10Ottomata) [13:33:07] (03PS1) 10D3r1ck01: Add virtual domain mapping for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) [13:33:21] (03PS5) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) [13:33:25] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:33:33] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:34:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:35:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:37:14] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102#11276724 (10brouberol) I don't know why I haven't posted it here, but I should have posted this [Kafka upgrade plan](https://docs.google.com/... [13:37:56] (03PS47) 10Brouberol: Deploy an opensearch cluster to the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184932 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:38:32] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102#11276731 (10elukey) We should form a working group to get this done, maybe in two quarters starting from the next one? One for testing the up... [13:39:03] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml, es2053.yaml: Prepare es2053 for production [puppet] - 10https://gerrit.wikimedia.org/r/1196397 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [13:39:43] (03CR) 10Tiziano Fogli: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [13:40:00] (03PS48) 10Brouberol: Deploy an opensearch cluster to the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184932 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:40:26] (03PS6) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) [13:41:26] (03CR) 10Tiziano Fogli: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [13:43:30] (03CR) 10Brouberol: [C:03+1] Deploy an opensearch cluster to the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184932 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:54:22] (03CR) 10Cathal Mooney: Move addition of network device intermediate CA to separate file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:56:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Add es2053 T402859', diff saved to https://phabricator.wikimedia.org/P83929 and previous config saved to /var/cache/conftool/dbconfig/20251015-135630-fceratto.json [13:56:35] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [13:58:12] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102#11276818 (10brouberol) Agreed [13:59:57] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1400) [14:00:23] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-10-06-225918 to 2025-10-14-194525 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196439 (https://phabricator.wikimedia.org/T405130) (owner: 10Jforrester) [14:02:11] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-10-06-225918 to 2025-10-14-194525 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196439 (https://phabricator.wikimedia.org/T405130) (owner: 10Jforrester) [14:03:26] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:03:35] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:04:00] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:04:15] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:04:20] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:04:46] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:04:52] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:05:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11276918 (10MatthewVernon) So, the swift & Ceph nodes: - ms-be* please do 1 at a time (and check pingable again before moving onto the next) - ms-fe*... [14:05:22] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:05:23] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Do something with cloudcontrol100[8-10]-dev - https://phabricator.wikimedia.org/T406630#11276920 (10taavi) p:05Triage→03Medium [14:05:31] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Do something with cloudcontrol100[8-10]-dev - https://phabricator.wikimedia.org/T406630#11276923 (10taavi) [14:05:38] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-10-09-001812 to 2025-10-15-120631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196440 (owner: 10Jforrester) [14:07:58] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host mwdebug1001.eqiad.wmnet [14:08:13] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-10-09-001812 to 2025-10-15-120631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196440 (owner: 10Jforrester) [14:09:23] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:09:28] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy1003.eqiad.wmnet [14:09:56] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:10:10] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:10:49] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6002.drmrs.wmnet [14:11:00] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6010.drmrs.wmnet [14:11:25] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2053.codfw.wmnet [14:11:26] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2053.codfw.wmnet [14:11:28] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:11:34] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:41] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2053.codfw.wmnet [14:11:42] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2053.codfw.wmnet [14:11:50] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7010.magru.wmnet [14:11:54] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1001.eqiad.wmnet [14:12:03] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7003.magru.wmnet [14:12:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196437 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [14:12:24] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:13:15] (03PS7) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) [14:13:37] (03PS2) 10Cathal Mooney: Move addition of network device intermediate CA to separate file [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) [14:13:58] (03PS1) 10NMW03: Add wgSitename for azwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196460 (https://phabricator.wikimedia.org/T407358) [14:14:14] (03CR) 10Cathal Mooney: Move addition of network device intermediate CA to separate file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [14:14:27] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host mwdebug1002.eqiad.wmnet [14:14:29] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2053 slowly with 10 steps - Pooling in new host [14:14:47] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host mwdebug2001.codfw.wmnet [14:14:56] (03PS3) 10Cathal Mooney: Move addition of network device intermediate CA to separate file [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) [14:15:17] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1002.eqiad.wmnet [14:15:41] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: update egress after IDP ipaddr changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196437 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [14:16:05] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] zarcillo: update egress after IDP ipaddr changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196437 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [14:16:07] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host mwdebug2002.codfw.wmnet [14:16:17] (03PS1) 10Muehlenhoff: jaeger: Add new IDP IP addressess [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196461 (https://phabricator.wikimedia.org/T406455) [14:17:17] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:17:35] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:17:53] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:17:53] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [14:17:56] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:18:05] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:18:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11277045 (10colewhite) [14:19:04] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795#11277049 (10colewhite) [14:19:23] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1003.eqiad.wmnet [14:19:28] FIRING: KeyholderUnarmed: 18 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:19:56] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [14:19:58] (03CR) 10Bking: [C:03+2] Deploy an opensearch cluster to the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184932 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:20:45] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug2001.codfw.wmnet [14:21:13] (03PS11) 10CDanis: haproxy: add JA4H support [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) [14:21:13] (03PS2) 10CDanis: haproxy: enable ja4h on cp7008 [puppet] - 10https://gerrit.wikimedia.org/r/1195234 (https://phabricator.wikimedia.org/T406990) [14:22:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:22:06] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug2002.codfw.wmnet [14:22:48] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) es2053 slowly with 10 steps - Pooling in new host [14:23:22] (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184932 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:23:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'es2053 set ipaddr before pool-in', diff saved to https://phabricator.wikimedia.org/P83930 and previous config saved to /var/cache/conftool/dbconfig/20251015-142339-fceratto.json [14:23:48] (03CR) 10CDanis: haproxy: add JA4H support (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [14:24:28] RESOLVED: KeyholderUnarmed: 18 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:24:37] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2053 slowly with 10 steps - Pooling in new host [14:24:42] jouncebot: nowandnext [14:24:42] For the next 0 hour(s) and 35 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1400) [14:24:42] In 0 hour(s) and 5 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1430) [14:25:14] (03PS1) 10Ssingh: admin: add SSH fido key for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/1196463 [14:25:33] (03PS2) 10Ssingh: admin: add SSH FIDO key for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/1196463 [14:25:51] (03PS1) 10Majavah: hieradata: Fix duplicate role_contacts declaration for an-redacteddb [puppet] - 10https://gerrit.wikimedia.org/r/1196464 [14:26:27] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host parsoidtest1001.eqiad.wmnet [14:27:17] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:27:37] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:27:51] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7281/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196464 (owner: 10Majavah) [14:27:53] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:29:04] jhathaway: fabfur: heads up I'm rebooting deploy2002 (main deployment server) [14:29:23] claime: thanks [14:29:30] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1430) [14:30:21] (03CR) 10Slyngshede: [C:03+1] "The not_valid_after.timestamp() is deprecated, but the replacement isn't introduced until version 42.0.0 of the cryptography library, and " [puppet] - 10https://gerrit.wikimedia.org/r/1196430 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [14:30:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1002.eqiad.wmnet [14:31:45] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host parsoidtest1001.eqiad.wmnet [14:33:27] (03CR) 10Elukey: [C:03+1] Move addition of network device intermediate CA to separate file [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [14:33:58] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [14:34:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1002.eqiad.wmnet [14:34:28] FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:34:37] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:aqs-eqiad [14:34:44] (03PS1) 10Peter Fischer: SUP: upgrade Java 17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196470 (https://phabricator.wikimedia.org/T404417) [14:35:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key was also validated out-of-band" [puppet] - 10https://gerrit.wikimedia.org/r/1196463 (owner: 10Ssingh) [14:35:32] (03CR) 10Ssingh: [C:03+2] admin: add SSH FIDO key for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/1196463 (owner: 10Ssingh) [14:35:47] (03PS1) 10Scott French: Enroll 5% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196466 (https://phabricator.wikimedia.org/T405955) [14:35:48] (03PS1) 10Scott French: mw-(api-int|jobrunner): Serve ~ 1% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196467 (https://phabricator.wikimedia.org/T405955) [14:35:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2054.codfw.wmnet with reason: Setting up new ES host [14:36:42] (03CR) 10Clément Goubert: [C:03+1] mw-(api-int|jobrunner): Serve ~ 1% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196467 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [14:36:48] (03PS36) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [14:37:11] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:12] (03CR) 10Clément Goubert: [C:03+1] Enroll 5% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196466 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [14:37:28] (03PS37) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [14:37:51] !log armed keyholder on cumin1002 following reboot [14:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:49] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:38:51] (03CR) 10FNegri: hieradata: Fix duplicate role_contacts declaration for an-redacteddb (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1196464 (owner: 10Majavah) [14:39:05] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:39:23] FIRING: [4x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:28] RESOLVED: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:40:26] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2002.codfw.wmnet [14:40:58] FIRING: [2x] KeyholderUnarmed: 18 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:41:11] !log armed keyholder on deploy[1003|2002] following reboots [14:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:19] (03CR) 10Bunnypranav: [C:03+1] Add wgSitename for azwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196460 (https://phabricator.wikimedia.org/T407358) (owner: 10NMW03) [14:41:53] FIRING: [4x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:11] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:02] (03CR) 10Ladsgroup: [C:04-1] Add virtual domain mapping for OAuth (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [14:43:30] (03CR) 10Elukey: [C:03+2] role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:43:40] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone_es of es2033.codfw.wmnet onto es2054.codfw.wmnet [14:43:45] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2033 - Depool es2033.codfw.wmnet to then clone it to es2054.codfw.wmnet - fceratto@cumin1003 [14:44:14] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2033 - Depool es2033.codfw.wmnet to then clone it to es2054.codfw.wmnet - fceratto@cumin1003 [14:44:23] RESOLVED: [4x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:28] (03CR) 10Btullis: "Thanks for spotting this. Yes, it can be made more like a private Data Engineering server, so we can remove the WMCS references." [puppet] - 10https://gerrit.wikimedia.org/r/1196464 (owner: 10Majavah) [14:45:38] (03PS4) 10Andrew Bogott: prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 [14:45:58] RESOLVED: KeyholderUnarmed: 18 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:46:05] (03CR) 10CI reject: [V:04-1] prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [14:46:29] (03PS5) 10Andrew Bogott: prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 [14:46:56] (03CR) 10CI reject: [V:04-1] prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [14:47:07] (03PS6) 10Andrew Bogott: prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 [14:47:33] (03CR) 10CI reject: [V:04-1] prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [14:47:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [14:48:15] (03PS7) 10Andrew Bogott: prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 [14:48:38] (03PS2) 10Majavah: hieradata: Fix duplicate role_contacts declaration for an-redacteddb [puppet] - 10https://gerrit.wikimedia.org/r/1196464 [14:48:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [14:48:44] (03CR) 10CI reject: [V:04-1] prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [14:48:53] (03CR) 10Cathal Mooney: [C:03+2] Move addition of network device intermediate CA to separate file [puppet] - 10https://gerrit.wikimedia.org/r/1196406 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [14:48:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:49:23] FIRING: [7x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:24] (03PS8) 10Andrew Bogott: prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 [14:49:40] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7282/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196464 (owner: 10Majavah) [14:50:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [14:50:35] (03CR) 10Daniel Kinzler: api-gateway: Add rate limiting for REST gateway (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [14:51:14] (03PS26) 10Daniel Kinzler: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [14:51:14] (03PS2) 10Daniel Kinzler: api-gateway: support custom rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 [14:51:27] (03CR) 10Majavah: [V:03+1] hieradata: Fix duplicate role_contacts declaration for an-redacteddb (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1196464 (owner: 10Majavah) [14:51:51] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6003.drmrs.wmnet [14:51:52] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6011.drmrs.wmnet [14:52:27] (03CR) 10FNegri: [C:03+1] hieradata: Fix duplicate role_contacts declaration for an-redacteddb [puppet] - 10https://gerrit.wikimedia.org/r/1196464 (owner: 10Majavah) [14:53:15] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Fix duplicate role_contacts declaration for an-redacteddb [puppet] - 10https://gerrit.wikimedia.org/r/1196464 (owner: 10Majavah) [14:53:25] (03PS2) 10D3r1ck01: Add virtual domain mapping for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) [14:53:44] (03CR) 10D3r1ck01: Add virtual domain mapping for OAuth (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [14:54:36] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7011.magru.wmnet [14:54:39] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7004.magru.wmnet [14:54:51] (03CR) 10D3r1ck01: Add virtual domain mapping for OAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [14:55:28] The k8s LIST latency is because of the mw-script cleanup triggered by the reboot [14:56:37] (03PS38) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [14:59:18] (03PS39) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [15:03:03] (03PS1) 10Brouberol: data: add yubikey-generated ssh key to the brouberol user [puppet] - 10https://gerrit.wikimedia.org/r/1196473 [15:03:26] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Nokia: Support Python config generation and JSON-RPC transport in Homer - https://phabricator.wikimedia.org/T402511#11277312 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one, updated Homer is now live on our cumin hosts and they... [15:03:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2206 gradually with 4 steps - Pool db2206.codfw.wmnet in after cloning [15:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:23] (03PS1) 10Elukey: role::maps::{master,replica}_bookworm: fix max_workers value [puppet] - 10https://gerrit.wikimedia.org/r/1196474 [15:08:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:04] (03CR) 10Ladsgroup: [C:03+1] Add virtual domain mapping for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [15:18:29] PROBLEM - Host aqs1012 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:10] !log mforns@deploy2002 Started deploy [analytics/refinery@94efa6e] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@94efa6e8] [15:21:27] !log mforns@deploy2002 Finished deploy [analytics/refinery@94efa6e] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@94efa6e8] (duration: 02m 17s) [15:21:48] !log mforns@deploy2002 Started deploy [analytics/refinery@94efa6e]: Regular analytics weekly train [analytics/refinery@94efa6e8] [15:28:25] !log mforns@deploy2002 Finished deploy [analytics/refinery@94efa6e]: Regular analytics weekly train [analytics/refinery@94efa6e8] (duration: 06m 37s) [15:28:30] !log mforns@deploy2002 Started deploy [analytics/refinery@94efa6e] (thin): Regular analytics weekly train THIN [analytics/refinery@94efa6e8] [15:29:36] !log mforns@deploy2002 Finished deploy [analytics/refinery@94efa6e] (thin): Regular analytics weekly train THIN [analytics/refinery@94efa6e8] (duration: 01m 06s) [15:29:48] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11277470 (10Ladsgroup) [15:31:12] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6004.drmrs.wmnet [15:32:11] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:14] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6012.drmrs.wmnet [15:33:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:40] (03CR) 10Elukey: [C:03+2] role::maps::{master,replica}_bookworm: fix max_workers value [puppet] - 10https://gerrit.wikimedia.org/r/1196474 (owner: 10Elukey) [15:34:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:29] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7005.magru.wmnet [15:37:38] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7012.magru.wmnet [15:37:49] (03CR) 10Btullis: opensearch-cluster: Add secrets and network policy templates to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:40:57] (03PS2) 10Brouberol: data: add yubikey-generated ssh key to the brouberol user [puppet] - 10https://gerrit.wikimedia.org/r/1196473 [15:43:15] (03CR) 10Btullis: opensearch-cluster: Add secrets and network policy templates to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:44:19] (03CR) 10Btullis: opensearch-cluster: Add secrets and network policy templates to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:47:57] (03CR) 10Scott French: [C:03+1] "Thanks, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [15:48:17] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: Make cloudweb Icinga checks non-critical [puppet] - 10https://gerrit.wikimedia.org/r/1196019 (https://phabricator.wikimedia.org/T407208) (owner: 10Majavah) [15:49:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2206 gradually with 4 steps - Pool db2206.codfw.wmnet in after cloning [15:49:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2206.codfw.wmnet onto db2247.codfw.wmnet [15:57:44] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [16:01:11] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11277674 (10elukey) To keep archives happy: we have now eqiad and codfw pooled (old and new stack), with codfw running on what we hope it will be a more performant setup. Tomorrow I'l... [16:02:00] (03CR) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:03:28] (03PS1) 10Cwhite: WikimediaEvents: enable client-side error logging for plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196484 (https://phabricator.wikimedia.org/T340187) [16:08:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:12:01] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6005.drmrs.wmnet [16:12:07] (03PS40) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [16:12:35] (03CR) 10Matthias Mullie: Add reader exp to common settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [16:13:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:14:25] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6013.drmrs.wmnet [16:15:03] PROBLEM - Host ms-be2084 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:53] (03CR) 10Btullis: opensearch-cluster: Add secrets and network policy templates to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:16:35] !log eevans@cumin1003 END (FAIL) - Cookbook sre.cassandra.roll-reboot (exit_code=1) rolling reboot on A:aqs-eqiad [16:16:51] (03PS41) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [16:17:03] (03CR) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:18:16] (03CR) 10Bking: [C:03+1] zookeeper: remove check_prometheus, disable nrpe [puppet] - 10https://gerrit.wikimedia.org/r/1192855 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [16:19:04] (03Abandoned) 10Bking: elastic: remove decommissioned hosts in beta [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [16:19:57] (03CR) 10Dzahn: "she will have to be moved to the "shell user" section but without SSH keys. to be added to analytics-privatedata-users. see https://phab" [puppet] - 10https://gerrit.wikimedia.org/r/1196396 (https://phabricator.wikimedia.org/T406106) (owner: 10Ladsgroup) [16:19:59] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7006.magru.wmnet [16:20:05] RECOVERY - Host ms-be2084 is UP: PING OK - Packet loss = 0%, RTA = 30.48 ms [16:20:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:20:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11277856 (10Jhancock.wm) @MatthewVernon reseated the card and it's up now. My bad. confirmed the server sees the drives in the BMC but please lmk if anyt... [16:20:49] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7013.magru.wmnet [16:25:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:25:59] (03PS1) 10Clare Ming: Fix action_context for simple bot detection instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196490 (https://phabricator.wikimedia.org/T406359) [16:28:59] (03CR) 10BCornwall: [V:03+2 C:03+2] Add wmf-debci trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1196119 (owner: 10BCornwall) [16:30:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.937s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:30:21] (03CR) 10Btullis: [C:03+1] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:32:51] (03CR) 10Bking: [C:03+2] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:35:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.275s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:37:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:37:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:37:32] (03CR) 10Milimetric: [C:03+1] "thank you!" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196490 (https://phabricator.wikimedia.org/T406359) (owner: 10Clare Ming) [16:37:41] !log eevans@cumin1003 START - Cookbook sre.hosts.dhcp for host aqs1012.eqiad.wmnet [16:37:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:39:05] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:40:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11278009 (10cmooney) @BCornwall @Jclark-ctr provided thinks go ok in the intervening period... [16:40:43] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2053 slowly with 10 steps - Pooling in new host [16:40:45] eevans@cumin1003 dhcp (PID 3271521) is awaiting input [16:41:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11278019 (10cmooney) @BCornwall @Jclark-ctr provided thinks go ok in the intervening period... [16:41:51] (03PS1) 10BPirkle: Enable REST Sandbox on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) [16:41:53] (03PS1) 10FNegri: hiera: gitlab::runner::docker set MTU to 1450 [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) [16:42:04] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [16:43:25] (03PS2) 10Kimberly Sarabia: Add reader exp to common settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) [16:44:44] (03CR) 10Kimberly Sarabia: Add reader exp to common settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [16:45:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [16:46:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:47:15] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:48:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11278047 (10Jclark-ctr) I am good for that day just let me know time in advance [16:49:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11278052 (10Jclark-ctr) Good for this day just let me know time [16:49:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:49:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:52:07] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:52:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:52:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:52:50] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:53:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:53:13] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:53:22] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6006.drmrs.wmnet [16:55:44] (03PS1) 10Btullis: Change the name of the cluster deployed in the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196497 (https://phabricator.wikimedia.org/T404907) [16:55:47] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6014.drmrs.wmnet [16:56:25] (03CR) 10Bking: [C:03+2] Change the name of the cluster deployed in the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196497 (https://phabricator.wikimedia.org/T404907) (owner: 10Btullis) [16:58:24] (03Merged) 10jenkins-bot: Change the name of the cluster deployed in the opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196497 (https://phabricator.wikimedia.org/T404907) (owner: 10Btullis) [16:58:29] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:58:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:58:42] (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7283/console" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [17:00:05] swfrench-wmf: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1700). [17:00:21] o/ [17:03:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122638 (https://phabricator.wikimedia.org/T366095) (owner: 10DLynch) [17:04:11] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7014.magru.wmnet [17:04:29] (03CR) 10Marostegui: [C:03+1] prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 (owner: 10Andrew Bogott) [17:04:44] I'll be getting started shortly, but with a slightly different plan than I originally had for the window ... [17:05:18] (03CR) 10FNegri: hiera: gitlab::runner::docker set MTU to 1450 [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [17:06:04] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:06:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:06:49] (03CR) 10FNegri: "I tried running PCC, but it does not seem to be configured for the gitlab-runners project. I think we can skip it for such a small change." [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [17:11:13] PROBLEM - Host cp7007 is DOWN: PING CRITICAL - Packet loss = 100% [17:11:56] (03PS2) 10Scott French: deployment_server: Revert production mediawiki releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1196499 (https://phabricator.wikimedia.org/T405955) [17:12:48] (03CR) 10Scott French: [C:03+2] deployment_server: Revert production mediawiki releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1196499 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:14:01] (03PS1) 10Bking: opensearch-test: secrets-related corrections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196500 (https://phabricator.wikimedia.org/T406876) [17:15:01] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11278220 (10Jdlrobson-WMF) p:05Triage→03High [17:15:39] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11278221 (10Jdlrobson-WMF) [17:18:07] (03CR) 10Dzahn: [C:03+1] "the compiler issue seems to be a larger problem with facts not being synced to the compiler instances. I don't think there is configuratio" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [17:21:05] !log swfrench@deploy2002 Started scap sync-world: Revert to PHP 8.1 - T405955 [17:21:09] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:23:14] (03CR) 10Btullis: [C:03+1] opensearch-test: secrets-related corrections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196500 (https://phabricator.wikimedia.org/T406876) (owner: 10Bking) [17:23:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key has also been validated out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1196473 (owner: 10Brouberol) [17:23:52] !log swfrench@deploy2002 Finished scap sync-world: Revert to PHP 8.1 - T405955 (duration: 02m 47s) [17:24:01] (03CR) 10Btullis: [C:03+2] opensearch-test: secrets-related corrections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196500 (https://phabricator.wikimedia.org/T406876) (owner: 10Bking) [17:25:40] (03Merged) 10jenkins-bot: opensearch-test: secrets-related corrections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196500 (https://phabricator.wikimedia.org/T406876) (owner: 10Bking) [17:26:58] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host aqs1012.eqiad.wmnet [17:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:24] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11278338 (10Jdlrobson-WMF) [17:34:43] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6007.drmrs.wmnet [17:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:37:04] (03PS1) 10Btullis: Vendor the base.certificate module and generate a certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196502 (https://phabricator.wikimedia.org/T406876) [17:37:11] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6015.drmrs.wmnet [17:37:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11278381 (10Eevans) >>! In T405942#11273802, @RobH wrote: > > [ ... ] > >>>! In T405942#11268506, @Eevans wrote: >> Provided that the moves happen one at a... [17:38:52] (03CR) 10Bking: [C:03+2] Vendor the base.certificate module and generate a certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196502 (https://phabricator.wikimedia.org/T406876) (owner: 10Btullis) [17:41:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:41:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:41:44] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:41:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:41:54] (03PS1) 10Scott French: Disable enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196503 (https://phabricator.wikimedia.org/T405955) [17:47:21] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7015.magru.wmnet [17:51:10] 10ops-eqiad, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414 (10Eevans) 03NEW [17:51:41] 10ops-eqiad, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11278487 (10Eevans) p:05Triage→03High [17:51:53] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:25] (03PS1) 10Btullis: Use our PKI generated certificate for the opensearch http interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196505 (https://phabricator.wikimedia.org/T406876) [17:53:59] jouncebot: nowandnext [17:53:59] For the next 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T1700) [17:53:59] In 2 hour(s) and 6 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T2000) [17:54:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196503 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:55:50] (03PS2) 10Btullis: Use our PKI generated certificate for the opensearch http interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196505 (https://phabricator.wikimedia.org/T406876) [18:03:59] (03Merged) 10jenkins-bot: Disable enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196503 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:04:34] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1196503|Disable enrollment in PHP 8.3 (T405955)]] [18:04:39] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:07:07] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1196503|Disable enrollment in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:08:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196490 (https://phabricator.wikimedia.org/T406359) (owner: 10Clare Ming) [18:09:14] 10ops-eqiad, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11278586 (10VRiley-WMF) looking into this [18:10:23] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-reboot (exit_code=1) rolling reboot on A:cp-text_magru and not P{cp7001*} and A:cp [18:10:45] !log swfrench@deploy2002 swfrench: Continuing with sync [18:12:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11278592 (10RobH) [18:14:03] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6008.drmrs.wmnet [18:14:04] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_drmrs [18:14:55] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196503|Disable enrollment in PHP 8.3 (T405955)]] (duration: 10m 21s) [18:15:00] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:18:25] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp6016.drmrs.wmnet [18:18:25] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_drmrs [18:19:44] federico3: want me to deploy the gitlab-runner change and take a look at puppet on one of them? [18:21:54] (03CR) 10Dzahn: [C:03+2] gerrit: typo fix in post_sync_validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1196051 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:23:16] (03CR) 10Dzahn: [C:03+1] gerrit: ask the operator to merge puppet earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1196227 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:24:08] (03CR) 10Dzahn: [C:03+1] mediawiki/httpbb: Add 25.wikipedia.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/1196141 (https://phabricator.wikimedia.org/T407156) (owner: 10BCornwall) [18:24:26] (03CR) 10Dzahn: [C:03+1] phabricator: drop cluster_search config [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [18:26:06] (03CR) 10Dzahn: [C:03+1] wikimedia.support: Rm ncredir, add zendesk records [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [18:27:19] (03CR) 10Dzahn: [C:03+2] gerrit: re-enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1195432 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:28:22] (03CR) 10Dzahn: [C:03+2] "puppet is currently disabled on gerrit2003 for debugging. so this will be applied once it gets enabled again." [puppet] - 10https://gerrit.wikimedia.org/r/1195432 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:28:56] (03Merged) 10jenkins-bot: gerrit: typo fix in post_sync_validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1196051 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:29:49] (03CR) 10Matthias Mullie: [C:03+1] Add reader exp to common settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [18:30:23] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7016.magru.wmnet [18:30:23] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_magru [18:33:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11278642 (10Dzahn) a:03Ahoelzl [18:35:00] (03CR) 10Dzahn: [C:03+1] P:cache::haproxy: exempt releases.wikimedia.org from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [18:40:28] (03PS3) 10Bking: Use our PKI generated certificate for the opensearch http interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196505 (https://phabricator.wikimedia.org/T406876) (owner: 10Btullis) [18:45:18] !log eevans@cumin1003 START - Cookbook sre.hosts.dhcp for host aqs1012.eqiad.wmnet [18:48:22] eevans@cumin1003 dhcp (PID 3284321) is awaiting input [18:49:21] RECOVERY - Host aqs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [18:51:53] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [18:52:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [18:53:14] (03PS4) 10Bking: Use our PKI generated certificate for the opensearch http interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196505 (https://phabricator.wikimedia.org/T406876) (owner: 10Btullis) [18:54:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [18:54:13] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [18:54:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [18:55:31] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [18:56:34] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [18:56:46] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [18:57:53] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [18:57:59] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [18:58:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:00:53] PROBLEM - Host aqs1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:03:14] !log sudo ipmitool -I lanplus -H "cp7007.mgmt.magru.wmnet" -U root -E chassis power cycle [19:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:13] RECOVERY - Host cp7007 is UP: PING OK - Packet loss = 0%, RTA = 110.55 ms [19:08:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:08:56] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:09:19] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:09:21] RECOVERY - Host aqs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:09:45] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:09:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:09:55] depooled, downtiming [19:10:11] PROBLEM - haproxy process on cp7007 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [19:11:11] RECOVERY - haproxy process on cp7007 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [19:11:45] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp7007 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-12-19 12:29:31 +0000 (expires in 64 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:11:45] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp7007 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2025-11-14 05:58:19 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/HTTPS [19:12:10] 10ops-magru, 06DC-Ops, 06Traffic: cp7007 hardware issues after reboot - https://phabricator.wikimedia.org/T407421 (10ssingh) 03NEW [19:12:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:12:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:13:22] 10ops-magru, 06DC-Ops, 06Traffic: cp7007 hardware issues after reboot - https://phabricator.wikimedia.org/T407421#11278777 (10ssingh) a:03BCornwall [19:18:53] PROBLEM - Host aqs1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:19:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:19:30] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7007.magru.wmnet with reason: hardware issues, depooled [19:20:16] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:20:35] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:21:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:21:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:21:34] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:21:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11278804 (10Jhancock.wm) @elukey looks like the drives will need to be wiped before you can convert to uefi. if they are setup in BIOS originally, they w... [19:24:04] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11278807 (10TheDJ) i now see loadtimes that are around 3 seconds, so at least it seems better experience wise. Still not as fast as it has been I think, but possibly acceptable. [19:24:14] (03CR) 10DCausse: "I believe that values-staging.yaml might contain a hardcoded test image with java17, might make sense to remove it now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196470 (https://phabricator.wikimedia.org/T404417) (owner: 10Peter Fischer) [19:27:34] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS bullseye [19:27:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11278811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host cp2045.codfw.wmnet with OS bullseye [19:28:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:28:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:28:44] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:28:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:29:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:29:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:29:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:29:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:29:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:30:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:30:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:30:42] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:30:53] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:31:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11278830 (10Jhancock.wm) [19:32:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:34:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:38:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:41:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:41:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:41:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:41:35] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:41:41] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:42:29] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: sync [19:42:33] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: sync [19:43:52] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11278865 (10TheDJ) One more thing that I think we should consider in this entire story.. navboxes. As go... [19:44:09] jhancock@cumin1002 reimage (PID 35263) is awaiting input [19:45:12] (03PS1) 10Robertsky: throttle rule for National Library Board Singapore workshop on 20251018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196519 (https://phabricator.wikimedia.org/T407422) [19:46:01] (03CR) 10CI reject: [V:04-1] throttle rule for National Library Board Singapore workshop on 20251018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196519 (https://phabricator.wikimedia.org/T407422) (owner: 10Robertsky) [19:51:27] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2045.codfw.wmnet with OS bullseye [19:51:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11278889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host cp2045.codfw.wmnet with OS bullseye executed with er... [19:52:09] (03PS2) 10Robertsky: throttle rule for National Library Board Singapore workshop on 18oct2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196519 (https://phabricator.wikimedia.org/T407422) [19:54:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196519 (https://phabricator.wikimedia.org/T407422) (owner: 10Robertsky) [19:55:02] (03CR) 10Chlod Alejandro: [C:03+1] throttle rule for National Library Board Singapore workshop on 18oct2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196519 (https://phabricator.wikimedia.org/T407422) (owner: 10Robertsky) [19:58:32] (03PS2) 10Zoranzoki21: Enable protection indicator for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196520 (https://phabricator.wikimedia.org/T407183) [19:58:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196520 (https://phabricator.wikimedia.org/T407183) (owner: 10Zoranzoki21) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T2000). [20:00:05] kimberly_sarabia, kemayo, cjming, robertsky, and kizule: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] o/ [20:00:36] o/ [20:00:41] I just have a config patch that could be bundled with others. [20:01:20] kimberly_sarabia: you around? i can deploy your and David's patch together [20:01:25] Hi! [20:01:35] o/ [20:02:01] Kemayo, do you want to go ahead (assuming you can self-deploy)? [20:02:20] i can do the rest of the patches in the window for those who need a deployer [20:02:24] i will need help to deploy. [20:02:42] robertsky: i gotchu [20:03:19] ..and wipe the memcached key as well cuz the event is <72 hours. >.< [20:03:21] cjming: Sure, I can go ahead. [20:03:31] great - just lmk when you're done [20:04:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122638 (https://phabricator.wikimedia.org/T366095) (owner: 10DLynch) [20:04:27] robertsky: not sure i've ever done that before - do you know what the cmd is? [20:04:50] (03Merged) 10jenkins-bot: DiscussionTools: enable thanking comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122638 (https://phabricator.wikimedia.org/T366095) (owner: 10DLynch) [20:05:15] mwscript-k8s --comment='T407422' --follow -- resetAuthenticationThrottle.php --wiki=aawiki --signup --ip=118.189.131.194 [20:05:16] T407422: Request IP whitelisting for editing workshop with National Library of SIngapore on 18 October 2025 - https://phabricator.wikimedia.org/T407422 [20:05:20] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1122638|DiscussionTools: enable thanking comments (T366095)]] [20:05:24] T366095: Deploy comment thanking to all wikis - https://phabricator.wikimedia.org/T366095 [20:07:42] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1122638|DiscussionTools: enable thanking comments (T366095)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:48] cjming: https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [20:08:06] !log kemayo@deploy2002 kemayo: Continuing with sync [20:08:50] deploy, then "sync-file wmf-config/throttle.php", then "mwscript-k8s --comment='T407422' --follow -- resetAuthenticationThrottle.php --wiki=aawiki --signup --ip=118.189.131.194" [20:10:21] RECOVERY - Host aqs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [20:12:24] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122638|DiscussionTools: enable thanking comments (T366095)]] (duration: 07m 04s) [20:12:28] T366095: Deploy comment thanking to all wikis - https://phabricator.wikimedia.org/T366095 [20:12:36] cjming: Okay, you're free to do whatever now. [20:12:52] thanks! [20:13:20] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11278940 (10Izno) That was done some time ago, and isn't particularly related to this task. {T198949} Th... [20:13:46] i'm going go ahead with my patch next [20:14:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196490 (https://phabricator.wikimedia.org/T406359) (owner: 10Clare Ming) [20:15:45] robertsky: thanks - so i just need to run that script on a deployment server? [20:16:35] (03Merged) 10jenkins-bot: Fix action_context for simple bot detection instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196490 (https://phabricator.wikimedia.org/T406359) (owner: 10Clare Ming) [20:17:04] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1196490|Fix action_context for simple bot detection instrument (T406359)]] [20:17:08] T406359: Work on client-side Bot Detection - https://phabricator.wikimedia.org/T406359 [20:17:44] cjming: yes. [20:19:15] do i need to sync the throttle file manually or presumably does scap backport take care of that? [20:19:20] !log cjming@deploy2002 cjming: Backport for [[gerrit:1196490|Fix action_context for simple bot detection instrument (T406359)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:20:06] !log cjming@deploy2002 cjming: Continuing with sync [20:20:30] manually, as far as I can tell from the instructions on https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [20:21:33] sorry i'm late [20:23:51] It's alright, we still have time :D [20:24:16] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196490|Fix action_context for simple bot detection instrument (T406359)]] (duration: 07m 12s) [20:24:20] T406359: Work on client-side Bot Detection - https://phabricator.wikimedia.org/T406359 [20:24:50] kimberly_sarabia: i'll do yours next [20:25:37] (03PS3) 10Kimberly Sarabia: Add reader exp to common settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) [20:25:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [20:25:56] cjming: thank you! [20:26:40] (03Merged) 10jenkins-bot: Add reader exp to common settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196192 (https://phabricator.wikimedia.org/T406916) (owner: 10Kimberly Sarabia) [20:27:10] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1196192|Add reader exp to common settings (T406916)]] [20:27:14] T406916: Reader Experiments: Deploy extension to production Arabic, Vietnamese, French, Chinese, Indonesian Wikipedia - https://phabricator.wikimedia.org/T406916 [20:28:48] for any SREs in the house: where do i run `sync-file wmf-config/throttle.php` from? [20:28:53] PROBLEM - Host aqs1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:29:34] !log cjming@deploy2002 ksarabia, cjming: Backport for [[gerrit:1196192|Add reader exp to common settings (T406916)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:29:54] kimberly_sarabia: lmk when to sync [20:30:21] RECOVERY - Host aqs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:31:28] cjming: one moment. i need to check with team for a sanity check. it looks funky right now [20:32:42] np - standing by [20:32:53] PROBLEM - Host aqs1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:33:15] cjming: seems that from T394639, scap backport should suffice, then memcached clear [20:33:15] T394639: Temporary IP lift request for Leeds University Wednesday 21 May 1130-1630 UTC - https://phabricator.wikimedia.org/T394639 [20:33:17] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host aqs1012.eqiad.wmnet [20:34:21] RECOVERY - Host aqs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:34:47] robertsky: sounds good [20:36:27] cjming: I didn't hear from anyone. Since this is gated, I'm ok moving forward if you are [20:36:43] up to you - so i'm clear to sync? [20:36:50] cjming: yes [20:36:54] !log cjming@deploy2002 ksarabia, cjming: Continuing with sync [20:37:05] 🤞 [20:37:46] robertsky: i'll do your patch next after Kim's finishes [20:37:51] PROBLEM - Host aqs1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:11] ok [20:39:39] kizule: do you need a deployer? [20:40:48] cjming: Yup, I don't have access for deployments. [20:41:01] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196192|Add reader exp to common settings (T406916)]] (duration: 13m 51s) [20:41:05] T406916: Reader Experiments: Deploy extension to production Arabic, Vietnamese, French, Chinese, Indonesian Wikipedia - https://phabricator.wikimedia.org/T406916 [20:41:18] Kizule: np - i can take care of it [20:41:59] cjming: Ty [20:42:23] kimberly_sarabia: should be live! [20:42:39] (03PS3) 10Robertsky: throttle rule for National Library Board Singapore workshop on 18oct2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196519 (https://phabricator.wikimedia.org/T407422) [20:43:20] cjming: thanks! [20:43:28] yw! [20:43:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196519 (https://phabricator.wikimedia.org/T407422) (owner: 10Robertsky) [20:44:22] (03Merged) 10jenkins-bot: throttle rule for National Library Board Singapore workshop on 18oct2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196519 (https://phabricator.wikimedia.org/T407422) (owner: 10Robertsky) [20:44:52] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1196519|throttle rule for National Library Board Singapore workshop on 18oct2025 (T407422)]] [20:44:55] T407422: Request IP whitelisting for editing workshop with National Library of SIngapore on 18 October 2025 - https://phabricator.wikimedia.org/T407422 [20:45:41] robertsky: assuming there's nothing to check so go ahead and sync when it's ready? [20:45:50] yup [20:46:46] please do. only time to check is when Saturday comes. [20:47:12] !log cjming@deploy2002 cjming, robertsky: Backport for [[gerrit:1196519|throttle rule for National Library Board Singapore workshop on 18oct2025 (T407422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:47:34] !log cjming@deploy2002 cjming, robertsky: Continuing with sync [20:51:39] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196519|throttle rule for National Library Board Singapore workshop on 18oct2025 (T407422)]] (duration: 06m 48s) [20:51:44] T407422: Request IP whitelisting for editing workshop with National Library of SIngapore on 18 October 2025 - https://phabricator.wikimedia.org/T407422 [20:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:52:43] robertsky: should be live! and i just ran the script [20:53:06] ok. so the memecached is cleared? thanks! [20:53:07] (03PS3) 10Zoranzoki21: Enable protection indicator for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196520 (https://phabricator.wikimedia.org/T407183) [20:53:14] robertsky: yes! [20:53:30] cjming: thanks for the help. :3 [20:53:43] you're so welcome! [20:54:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196520 (https://phabricator.wikimedia.org/T407183) (owner: 10Zoranzoki21) [20:54:58] (03Merged) 10jenkins-bot: Enable protection indicator for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196520 (https://phabricator.wikimedia.org/T407183) (owner: 10Zoranzoki21) [20:55:29] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1196520|Enable protection indicator for srwiki (T407183)]] [20:55:33] T407183: Enable protection indicator for srwiki - https://phabricator.wikimedia.org/T407183 [20:57:52] !log cjming@deploy2002 cjming, zoranzoki21: Backport for [[gerrit:1196520|Enable protection indicator for srwiki (T407183)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:57:57] Checking... [20:58:53] ty [20:59:33] cjming: LGTM! [20:59:42] yay! syncing [20:59:46] !log cjming@deploy2002 cjming, zoranzoki21: Continuing with sync [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T2100) [21:00:58] just wrapping up the late UTC backport window here soon - last config patch should finish in a few minutes [21:03:43] !log adding additional disk space to cloudbackup1002-dev with "sudo gnt-instance modify --disk add:size=60g cloudbackup1002-dev.eqiad.wmnet" [21:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:54] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196520|Enable protection indicator for srwiki (T407183)]] (duration: 08m 25s) [21:03:58] T407183: Enable protection indicator for srwiki - https://phabricator.wikimedia.org/T407183 [21:04:22] Looks good cjming. Thank you so much for your help! [21:04:43] Kizule: yw! should be live now [21:05:08] !log end of UTC late backport window [21:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:12] It is, thanks! [21:10:55] (03PS1) 10Bvibber: This copies .23's revert of the _broken version_ of the CORS image load fix! Production should work fine without it, but the broken version breaks things worse than the original bug. -bv [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196526 [21:15:19] (03CR) 10Kimberly Sarabia: [C:03+1] This copies .23's revert of the _broken version_ of the CORS image load fix! Production should work fine without it, but the broken version [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196526 (owner: 10Bvibber) [21:18:25] if nobody minds i'm gonna sneak in that backport :D [21:21:09] (03CR) 10Bvibber: [C:03+2] "self-deploying this as a quick fix" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196526 (owner: 10Bvibber) [21:21:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196526 (owner: 10Bvibber) [21:22:05] (03Merged) 10jenkins-bot: This copies .23's revert of the _broken version_ of the CORS image load fix! Production should work fine without it, but the broken version breaks things worse than the original bug. -bv [extensions/ReaderExperiments] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1196526 (owner: 10Bvibber) [21:22:18] i made that revert message too long by mistake lol [21:22:38] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1196526|This copies .23's revert of the _broken version_ of the CORS image load fix! Production should work fine without it, but the broken version breaks things worse than the original bug. -bv]] [21:24:56] !log bvibber@deploy2002 bvibber: Backport for [[gerrit:1196526|This copies .23's revert of the _broken version_ of the CORS image load fix! Production should work fine without it, but the broken version breaks things worse than the original bug. -bv]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:25:45] !log bvibber@deploy2002 bvibber: Continuing with sync [21:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:51] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196526|This copies .23's revert of the _broken version_ of the CORS image load fix! Production should work fine without it, but the broken version breaks things worse than the original bug. -bv]] (duration: 07m 13s) [21:30:57] done [21:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:35:40] !log andrew@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM cloudbackup1002-dev.eqiad.wmnet [21:51:53] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:58:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724#11279385 (10Andrew) I currently can't run 'gnt-instance console' in eqiad1/ganeti1048 because of host key issues. Foolishly I trie... [21:58:58] (03PS1) 10Hamish: Create "autopatrolled" user group on Danish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196531 (https://phabricator.wikimedia.org/T407281) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251015T2200) [22:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:33:56] (03PS4) 10Andrea Denisse: alertmanager: Add Slack route for the rweb team [puppet] - 10https://gerrit.wikimedia.org/r/1196533 (https://phabricator.wikimedia.org/T406689) [22:42:20] (03CR) 10SimmeD: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196531 (https://phabricator.wikimedia.org/T407281) (owner: 10Hamish) [22:56:01] !log andrew@cumin2002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM cloudbackup1002-dev.eqiad.wmnet [23:18:08] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11279617 (10VRiley-WMF) Juniper has created RMA id #: R200590693 for this replacment. [23:29:08] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aqs1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [23:32:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [23:38:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196540 [23:38:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196540 (owner: 10TrainBranchBot) [23:48:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [23:49:23] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:50:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1196540 (owner: 10TrainBranchBot) [23:51:53] RESOLVED: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown