[00:00:03] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167313 (owner: 10TrainBranchBot) [00:08:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167714 [00:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167714 (owner: 10TrainBranchBot) [00:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:12:09] (03PS1) 10Ssingh: team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) [00:13:19] (03CR) 10CI reject: [V:04-1] team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [00:22:41] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1165832/6229/" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [00:22:51] (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [00:32:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167714 (owner: 10TrainBranchBot) [00:32:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "all 3 servers have the new firewall rule and the gerrit config change. I did a service restart on gerrit2002 (to verify there is no syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [00:35:18] (03PS5) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) [00:36:37] (03CR) 10Dzahn: "changed "standby" to "spare" host to address concerns about confusing naming" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [00:36:59] (03CR) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [00:39:45] (03CR) 10Dzahn: "the linked task is currently stalled and we have agreed to only do this once we have a real decision there. so this code change is also st" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [00:41:22] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:41:52] (03Abandoned) 10Dzahn: gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [00:42:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:42:40] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1006.eqiad.wmnet with OS bookworm [00:45:41] (03CR) 10Dzahn: [V:03+1 C:03+2] "If we agree on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129920 and once gerrit2002 is down.. I would then make a new patch to" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [00:48:06] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:48:12] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:50:24] (03CR) 10Dzahn: [V:03+1 C:03+2] "same for the replica settings. once we drop gerrit2002 and only gerrit2003 is left we can replace the host name string with the replica_ho" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [00:57:36] RECOVERY - Disk space on an-worker1082 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1082&var-datasource=eqiad+prometheus/ops [01:39:27] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm [01:53:00] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1006.eqiad.wmnet with OS bookworm [01:53:30] !log andrew@cumin1003 START - Cookbook sre.hosts.dhcp for host cloudcephosd1006.eqiad.wmnet [01:55:03] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudcephosd1006.eqiad.wmnet [02:03:50] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:03:56] !log root@cumin1003 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:03:59] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:04:03] !log root@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:04:27] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:11:18] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:28:51] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:29:07] !log root@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:29:43] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:37:27] !log root@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:39:23] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:46:47] !log root@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [02:53:22] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm [03:01:35] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1006.eqiad.wmnet with OS bookworm [03:01:50] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm [03:15:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:17:49] andrew@cumin1003 reimage (PID 1068697) is awaiting input [03:30:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [03:35:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [03:36:49] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1006.eqiad.wmnet with OS bookworm [03:37:12] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm [03:50:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10990315 (10VRiley-WMF) @Marostegui I have carved up the RAID into a RAID 10 and reimaged these servers. Would you be able to check it to see if it works out for you? [03:55:25] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [03:58:40] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [04:10:03] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [04:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:51] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [04:13:10] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [04:14:17] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [04:16:19] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [04:16:28] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bookworm [04:17:20] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [04:18:01] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [04:29:11] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [04:32:45] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [04:32:56] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [04:46:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:51:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:55:56] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [04:56:42] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [04:56:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:57:49] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [04:58:36] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [05:01:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:11:55] (03PS1) 10Giuseppe Lavagetto: Bugfixes for dependents, rename [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167732 [05:17:26] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfixes for dependents, rename [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167732 (owner: 10Giuseppe Lavagetto) [05:21:10] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:21:42] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1003" [05:21:44] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1003 [05:22:18] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1003 [05:22:19] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1003" [05:25:20] (03CR) 10KartikMistry: [C:03+2] machinetranslation: staging: Update MinT to 2025-07-09-124154-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167608 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [05:26:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:27:02] (03Merged) 10jenkins-bot: machinetranslation: staging: Update MinT to 2025-07-09-124154-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167608 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [05:31:24] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 252.08 ms [05:31:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:38:08] Quick deploy of MinT on staging.. [05:38:23] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:54:19] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0600) [06:00:05] marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0600) [06:09:48] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2001.codfw.wmnet, wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:10:48] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:12:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10990367 (10elukey) @Jclark-ctr IIUC it was a temporary failure right? [06:14:48] 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10990368 (10KartikMistry) Status update: We're testing the `entrypoint.sh` in the staging (using `values-staging.yaml`). Currentl... [06:21:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10990381 (10Marostegui) Looking good on both hosts @VRiley-WMF! Thank you so much! ` VD LIST : ======= --------------------------------------------------------------- DG/VD TYPE... [06:22:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10990382 (10Marostegui) 05Open→03Resolved [06:25:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1036.eqiad.wmnet with reason: Maintenance [06:28:37] (03PS1) 10Muehlenhoff: Record LDAP access of vpm [puppet] - 10https://gerrit.wikimedia.org/r/1167735 [06:30:02] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access of vpm [puppet] - 10https://gerrit.wikimedia.org/r/1167735 (owner: 10Muehlenhoff) [06:35:31] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2228.codfw.wmnet with reason: Maintenance [06:35:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2228 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78846 and previous config saved to /var/cache/conftool/dbconfig/20250710-063535-marostegui.json [06:36:15] (03PS1) 10Marostegui: db2228: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167736 (https://phabricator.wikimedia.org/T398928) [06:36:17] (03PS3) 10Muehlenhoff: memcached::instance: Remove support for Ferm-only syntax [puppet] - 10https://gerrit.wikimedia.org/r/1161511 [06:36:44] (03CR) 10Marostegui: [C:03+2] db2228: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167736 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [06:37:06] (03CR) 10Arnaudb: [C:03+1] "this is one of those changes where the context is way longer than the change 😄 thanks @dzahn@wikimedia.org it looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [06:39:05] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2228.codfw.wmnet with reason: Maintenance [06:43:38] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167240 (owner: 10Muehlenhoff) [06:44:11] (03PS1) 10Marostegui: db1210: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167737 (https://phabricator.wikimedia.org/T398928) [06:44:38] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [06:44:46] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [06:44:55] (03CR) 10Marostegui: [C:03+2] db1210: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167737 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [06:45:55] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1210.eqiad.wmnet with reason: Maintenance [06:45:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1210 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78847 and previous config saved to /var/cache/conftool/dbconfig/20250710-064558-marostegui.json [06:46:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78848 and previous config saved to /var/cache/conftool/dbconfig/20250710-064605-root.json [06:47:08] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [06:49:10] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [06:50:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff) [06:52:08] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru and A:cp - 2.8.15 upgrade (T398720) [06:52:11] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [06:52:38] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [06:53:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78849 and previous config saved to /var/cache/conftool/dbconfig/20250710-065350-root.json [06:55:37] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [06:58:30] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [06:58:52] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [06:59:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:59:26] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru and A:cp - 2.8.15 upgrade (T398720) [06:59:29] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [07:00:04] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0700) [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:39] !log installing libbpf security updates [07:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78851 and previous config saved to /var/cache/conftool/dbconfig/20250710-070111-root.json [07:03:40] (03PS1) 10Muehlenhoff: Add library hint for libbpf [puppet] - 10https://gerrit.wikimedia.org/r/1167738 [07:06:21] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libbpf [puppet] - 10https://gerrit.wikimedia.org/r/1167738 (owner: 10Muehlenhoff) [07:08:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78852 and previous config saved to /var/cache/conftool/dbconfig/20250710-070855-root.json [07:10:28] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [07:15:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:16:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78853 and previous config saved to /var/cache/conftool/dbconfig/20250710-071616-root.json [07:17:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152 (10Addshore) 03NEW [07:17:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [07:18:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Maintenance [07:21:21] haproxy alerts will be expected [07:21:57] (03PS1) 10Marostegui: mariadb: Move db1213 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1167741 (https://phabricator.wikimedia.org/T399060) [07:22:03] (03PS1) 10Elukey: machinetranslation: add snippet to fetch private env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167742 (https://phabricator.wikimedia.org/T335491) [07:22:26] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:22:28] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:23:23] (03CR) 10CI reject: [V:04-1] machinetranslation: add snippet to fetch private env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167742 (https://phabricator.wikimedia.org/T335491) (owner: 10Elukey) [07:23:27] (03PS2) 10Marostegui: mariadb: Move db1213 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1167741 (https://phabricator.wikimedia.org/T399060) [07:24:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78855 and previous config saved to /var/cache/conftool/dbconfig/20250710-072401-root.json [07:25:03] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1213 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1167741 (https://phabricator.wikimedia.org/T399060) (owner: 10Marostegui) [07:25:48] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [07:28:15] (03PS2) 10Elukey: machinetranslation: add snippet to fetch private env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167742 (https://phabricator.wikimedia.org/T335491) [07:29:09] !log Restarting CI Jenkins [07:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78856 and previous config saved to /var/cache/conftool/dbconfig/20250710-073123-root.json [07:36:26] (03CR) 10Elukey: [C:03+2] machinetranslation: add snippet to fetch private env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167742 (https://phabricator.wikimedia.org/T335491) (owner: 10Elukey) [07:39:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78857 and previous config saved to /var/cache/conftool/dbconfig/20250710-073907-root.json [07:39:46] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru and A:cp - 2.8.15 upgrade (T398720) [07:39:49] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [07:39:51] (03CR) 10Vgutierrez: [C:03+1] hiera: disable OCSP for GTS certs [puppet] - 10https://gerrit.wikimedia.org/r/1167687 (https://phabricator.wikimedia.org/T399079) (owner: 10Ssingh) [07:39:57] FIRING: CertAlmostExpired: Certificate for service lsw1-d1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-d1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:40:40] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy, C:haproxy, hiera: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [07:41:47] (03CR) 10Vgutierrez: [C:03+1] nagios_common: remove check_ssl_cdn_ocsp* [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [07:43:03] (03PS1) 10Marostegui: db2178: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167743 (https://phabricator.wikimedia.org/T398928) [07:43:12] (03CR) 10Vgutierrez: [C:03+1] "brett, FYI icinga checks get applied on alert hosts, so PCC would need to include alert1002.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [07:43:37] (03CR) 10Marostegui: [C:03+2] db2178: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167743 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [07:44:25] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru and A:cp - 2.8.15 upgrade (T398720) [07:44:29] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2178.codfw.wmnet with reason: Maintenance [07:44:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2178 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78858 and previous config saved to /var/cache/conftool/dbconfig/20250710-074432-marostegui.json [07:44:54] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: sync [07:45:40] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [07:47:28] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: sync [07:50:59] !log switching to upload cert globally on upload CDN cluster - T394484 [07:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:02] T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484 [07:52:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78859 and previous config saved to /var/cache/conftool/dbconfig/20250710-075202-root.json [07:54:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1047.eqiad.wmnet with reason: Maintenance [07:55:09] (03CR) 10Slyngshede: [C:03+2] Netbox: add limit to rate [alerts] - 10https://gerrit.wikimedia.org/r/1167633 (owner: 10Slyngshede) [07:55:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [07:57:24] (03Merged) 10jenkins-bot: Netbox: add limit to rate [alerts] - 10https://gerrit.wikimedia.org/r/1167633 (owner: 10Slyngshede) [07:59:42] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [07:59:59] (03CR) 10David Caro: Cloudcephosd1048: Configure ceph with a single nic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [08:00:05] andre and jnuche: That opportune time for a MediaWiki train - Utc-0 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0800). [08:00:45] !log installing python-urllib3 security updates [08:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:28] o/ [08:02:47] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167819 (https://phabricator.wikimedia.org/T392179) [08:02:48] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167819 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:03:28] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:03:30] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:03:41] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167819 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:05:52] !log Depooling Liftwing prod in codfw so we can roll out some changes that restart all services (cf. T398533) [08:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:56] T398533: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533 [08:06:52] (03CR) 10David Caro: "The pcc LGTM, just a note on the datastructure there" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [08:07:00] !log klausman@cumin1002 conftool action : get/pooled; selector: dnsdisc=inference,name=codfw [08:07:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78860 and previous config saved to /var/cache/conftool/dbconfig/20250710-080708-root.json [08:07:41] !log klausman@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=inference,name=codfw [08:09:47] 07sre-alert-triage, 06serviceops: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158 (10LSobanski) 03NEW [08:10:14] !log installing containerd security updates [08:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:17] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: ProbeDown (instance data-gateway-staging:30443) - https://phabricator.wikimedia.org/T399159 (10LSobanski) 03NEW [08:10:28] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: ProbeDown (instance data-gateway-staging:30443) - https://phabricator.wikimedia.org/T399159#10990844 (10LSobanski) The alert is firing for both eqiad and codfw. [08:10:59] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) - https://phabricator.wikimedia.org/T399160 (10LSobanski) 03NEW [08:11:28] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.9 refs T392179 [08:11:32] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [08:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:13] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - 2.8.15 upgrade (T398720) [08:12:16] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [08:15:07] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - 2.8.15 upgrade (T398720) [08:16:09] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161 (10LSobanski) 03NEW [08:21:58] (03CR) 10Muehlenhoff: [C:03+2] memcached::instance: Remove support for Ferm-only syntax [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff) [08:22:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78861 and previous config saved to /var/cache/conftool/dbconfig/20250710-082213-root.json [08:26:37] (03CR) 10Muehlenhoff: [C:03+2] httpbb: Rebuild for Bookworm [software/httpbb] - 10https://gerrit.wikimedia.org/r/1146585 (https://phabricator.wikimedia.org/T393711) (owner: 10Muehlenhoff) [08:27:12] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1092195 (https://phabricator.wikimedia.org/T135991) [08:30:04] (03PS2) 10Muehlenhoff: mariadb::ferm_wmcs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1037766 [08:30:08] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [08:30:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037766 (owner: 10Muehlenhoff) [08:31:18] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [08:33:15] (03PS5) 10Muehlenhoff: cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556 [08:37:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78863 and previous config saved to /var/cache/conftool/dbconfig/20250710-083719-root.json [08:40:06] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [08:40:31] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [08:41:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [08:41:31] (03CR) 10Elukey: "Left some nits and high level comments, the work is really great and the new functionality is what we need. I didn't test the command but " [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [08:45:18] !log installing setuptools security updates [08:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:58] 06SRE, 06collaboration-services, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10990931 (10ABran-WMF) a:03ABran-WMF [08:51:17] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - 2.8.15 upgrade (T398720) [08:51:22] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [08:53:05] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - 2.8.15 upgrade (T398720) [08:59:29] (03CR) 10Filippo Giunchedi: "Thank you and see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) (owner: 10Jcrespo) [09:02:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:02:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:03:17] (03PS1) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) [09:12:00] (03PS2) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) [09:12:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Update db2240 T397163', diff saved to https://phabricator.wikimedia.org/P78865 and previous config saved to /var/cache/conftool/dbconfig/20250710-091250-fceratto.json [09:12:55] T397163: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T397163 [09:13:12] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6231/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:14:07] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2161 gradually with 4 steps - Pooling in [09:14:11] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2161 gradually with 4 steps - Pooling in [09:14:47] (03PS1) 10Klausman: httpbb: Add missing machinery to deplot article-models test file [puppet] - 10https://gerrit.wikimedia.org/r/1167827 [09:14:57] FIRING: [2x] CertAlmostExpired: Certificate for service cr1-magru.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:15:06] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2240 gradually with 4 steps - Pooling in [09:15:10] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2240 gradually with 4 steps - Pooling in [09:16:24] (03CR) 10AikoChou: [C:03+1] httpbb: Add missing machinery to deplot article-models test file [puppet] - 10https://gerrit.wikimedia.org/r/1167827 (owner: 10Klausman) [09:19:55] (03CR) 10Volans: "It's removing the `--filter-file /etc/docker-report/k8s_registry_rules.ini` from the registry one, expected?" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:19:57] FIRING: [3x] CertAlmostExpired: Certificate for service cr1-magru.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:21:47] (03PS2) 10Klausman: httpbb: Add missing machinery to deploy some tests [puppet] - 10https://gerrit.wikimedia.org/r/1167827 [09:24:00] (03CR) 10AikoChou: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1167827 (owner: 10Klausman) [09:24:07] (03PS3) 10Hashar: Use thirdparty/jenkins on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) [09:24:08] (03CR) 10Hashar: "Moritz asked for the rename in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137361/comments/e85643f6_01ddab1a" [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar) [09:24:11] (03CR) 10Klausman: [V:03+2 C:03+2] httpbb: Add missing machinery to deploy some tests [puppet] - 10https://gerrit.wikimedia.org/r/1167827 (owner: 10Klausman) [09:27:32] (03PS1) 10Hnowlan: Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 [09:27:43] (03CR) 10Klausman: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6232/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167827 (owner: 10Klausman) [09:28:36] (03CR) 10Jgiannelos: "Needs version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 (owner: 10Hnowlan) [09:28:59] (03PS2) 10Hnowlan: Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 [09:31:26] RESOLVED: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:58] dd [09:32:10] >_> [09:32:35] (03PS3) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) [09:34:57] (03CR) 10Klausman: httpbb(liftwing): add edit-check tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [09:35:44] (03PS3) 10Hnowlan: Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 [09:36:21] (03CR) 10Jgiannelos: [C:03+1] Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 (owner: 10Hnowlan) [09:36:58] (03CR) 10Clément Goubert: [C:03+2] "I reset the failed service, but it's not even supposed to try to start..." [puppet] - 10https://gerrit.wikimedia.org/r/1166213 (owner: 10Clément Goubert) [09:39:05] (03CR) 10Hnowlan: [C:03+2] Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 (owner: 10Hnowlan) [09:40:58] (03Merged) 10jenkins-bot: Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 (owner: 10Hnowlan) [09:43:22] !log installing initramfs-tools bugfix updates from Bookworm point release [09:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:04] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [09:44:32] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [09:45:38] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [09:45:49] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [09:56:29] (03PS1) 10Jgiannelos: changeprop: Fix file exclusion rule regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167830 [09:56:30] (03CR) 10Jcrespo: "Some comments about what to do next." [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) (owner: 10Jcrespo) [09:56:59] (03PS4) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) [09:57:07] (03CR) 10Hnowlan: [C:03+1] changeprop: Fix file exclusion rule regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167830 (owner: 10Jgiannelos) [09:57:55] (03CR) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [10:00:00] (03CR) 10Jgiannelos: [C:03+2] changeprop: Fix file exclusion rule regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167830 (owner: 10Jgiannelos) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1000) [10:01:02] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399097#10991127 (10cmooney) So yeah this continued to bounce after that yesterday, eventually going hard down and remains so. ` Jul... [10:01:37] (03Merged) 10jenkins-bot: changeprop: Fix file exclusion rule regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167830 (owner: 10Jgiannelos) [10:02:04] (03PS6) 10Jcrespo: prometheus: Proof of concept of a nrpe to prometheus translation wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T350360) [10:04:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#10991132 (10Ladsgroup) As WMF sponsor. This request has my support. I don't know what the policy is these days but if it needs a staff sponsor, it has mine [10:04:53] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:05:02] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:05:08] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:05:19] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:05:23] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:05:37] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:05:42] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:05:48] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:15:30] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10991171 (10MoritzMuehlenhoff) [10:15:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10991173 (10MoritzMuehlenhoff) [10:24:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [10:24:44] Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [10:33:16] !log kafka preferred-replica-election on kafka-main2010 [10:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:57] FIRING: [5x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:39:57] FIRING: [7x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:44:57] FIRING: [9x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:45:10] jelto@cumin1003 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:46:23] (03PS1) 10Marostegui: mariadb: Change backups host [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) [10:46:47] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui) [10:48:10] jelto@cumin1003 upgrade (PID 1090856) is awaiting input [10:49:57] FIRING: [11x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:50:39] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3646 MB (3% inode=98%): /tmp 3646 MB (3% inode=98%): /var/tmp 3646 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [10:54:57] FIRING: [13x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:52] (03CR) 10Jcrespo: [C:03+1] "Please let me know when the switchover happens (can be after the fact, don't depend on me) to make sure service is ok after it." [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui) [10:56:25] (03CR) 10Marostegui: [C:04-2] "Will do, aiming for Monday morning" [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui) [10:57:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:58:32] (03PS1) 10Jgiannelos: changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 [10:58:50] (03PS2) 10Jgiannelos: changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 [10:58:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:59:57] FIRING: [15x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:03:26] (03PS3) 10Jgiannelos: changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 (https://phabricator.wikimedia.org/T397750) [11:04:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1039', diff saved to https://phabricator.wikimedia.org/P78867 and previous config saved to /var/cache/conftool/dbconfig/20250710-110408-marostegui.json [11:04:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1039.eqiad.wmnet with reason: Maintenance [11:04:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [11:04:44] Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [11:04:57] FIRING: [17x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:06:11] (03CR) 10Clément Goubert: [C:03+1] hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [11:06:36] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [11:06:46] (03CR) 10Clément Goubert: [C:03+1] wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [11:06:58] (03CR) 10Hnowlan: [C:03+1] changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:09:27] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp - 2.8.15 upgrade (T398720) [11:09:31] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [11:09:33] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6233/co" [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [11:09:38] (03CR) 10Clément Goubert: [C:03+1] trafficserver, cache: add config for edge routing of hcaptcha [puppet] - 10https://gerrit.wikimedia.org/r/1167670 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [11:09:57] FIRING: [19x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:09:59] (03CR) 10Klausman: [V:03+1 C:03+2] httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [11:14:13] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - 2.8.15 upgrade (T398720) [11:14:57] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:15:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:16:51] (03PS3) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) [11:18:05] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6234/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [11:19:38] 07sre-alert-triage, 10Maps, 06serviceops: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158#10991406 (10Clement_Goubert) This is on a `maps-test` server, maybe alert severity should be brought down. Anyhow, tagging #maps project for follo... [11:21:25] (03CR) 10Elukey: "Should be fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [11:21:51] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1007.eqiad.wmnet'] [11:24:04] (03PS1) 10Klausman: httpbb: drop extraneous `files/` path element [puppet] - 10https://gerrit.wikimedia.org/r/1167836 [11:24:18] (03CR) 10Klausman: [V:03+2 C:03+2] httpbb: drop extraneous `files/` path element [puppet] - 10https://gerrit.wikimedia.org/r/1167836 (owner: 10Klausman) [11:25:06] (03CR) 10Jgiannelos: [C:03+2] changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:27:19] (03Merged) 10jenkins-bot: changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:29:40] !log andrew@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1007.eqiad.wmnet'] [11:30:36] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [11:30:38] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180 (10cmooney) 03NEW p:05Triage→03Medium [11:30:43] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bookworm [11:30:48] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [11:31:44] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#10991485 (10cmooney) 05Stalled→03Resolved a:03cmooney I am going to close this one (please ping me if that is hasty!) as I've o... [11:33:17] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [11:33:55] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991500 (10cmooney) [11:34:44] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10991503 (10Tobi_WMDE_SW) >>! In T398686#10985976, @Dzahn wrote: > @Tobi_WMDE_SW and/or @sowmya.guru, is this request only to add a new approver or is it _also_ for access for... [11:34:58] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [11:35:12] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [11:35:17] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:35:33] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:39:19] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [11:41:18] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1007.eqiad.wmnet with OS bookworm [11:44:28] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bookworm [11:46:09] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis mediawikiwiki, testwiki in section s5 [11:46:36] (03PS1) 10Marostegui: db1200: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167837 (https://phabricator.wikimedia.org/T398928) [11:47:12] (03CR) 10Marostegui: [C:03+2] db1200: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167837 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [11:47:36] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1200.eqiad.wmnet with reason: Maintenance [11:47:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1200 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78869 and previous config saved to /var/cache/conftool/dbconfig/20250710-114739-marostegui.json [11:48:59] (03PS1) 10Arnaudb: gerrit: enable gerrit.service and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167838 (https://phabricator.wikimedia.org/T372804) [11:49:43] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1007.eqiad.wmnet with OS bookworm [11:50:00] (03PS2) 10Arnaudb: gerrit: enable gerrit.service and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167838 (https://phabricator.wikimedia.org/T372804) [11:51:31] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis mediawikiwiki, testwiki in section s5 [11:51:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [11:52:02] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis mediawikiwiki, testwiki in section s5 [11:52:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10991577 (10cmooney) I created the below task to continue the discussion of how we set up the interfaces for these hosts, and cop... [11:53:16] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bookworm [11:53:34] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp - 2.8.15 upgrade (T398720) [11:53:37] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [11:55:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78870 and previous config saved to /var/cache/conftool/dbconfig/20250710-115534-root.json [11:56:23] fceratto@cumin1002 sanitize-wiki (PID 1181071) is awaiting input [11:56:59] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - 2.8.15 upgrade (T398720) [11:59:59] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 2.8.15 upgrade (T398720) [12:00:02] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1200) [12:00:47] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991587 (10cmooney) [12:01:47] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - 2.8.15 upgrade (T398720) [12:01:57] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991590 (10cmooney) [12:02:44] fceratto@cumin1002 sanitize-wiki (PID 1181071) is awaiting input [12:06:18] (03CR) 10AikoChou: [C:03+2] "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167565 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [12:06:18] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis mediawikiwiki, testwiki in section s5 [12:07:51] (03Merged) 10jenkins-bot: ml-services: update edit-check image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167565 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [12:10:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78871 and previous config saved to /var/cache/conftool/dbconfig/20250710-121039-root.json [12:11:24] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage [12:14:02] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage [12:15:48] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis mediawikiwiki, testwiki in section s3 [12:17:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:56] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitize-wiki (exit_code=97) Managing sanitization for wikis mediawikiwiki, testwiki in section s3 [12:18:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:21:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:24:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:25:11] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:25:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78872 and previous config saved to /var/cache/conftool/dbconfig/20250710-122545-root.json [12:27:07] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:09] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 4.513 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.433 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:30:39] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3616 MB (3% inode=98%): /tmp 3616 MB (3% inode=98%): /var/tmp 3616 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [12:32:17] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng: disable tag->sha256 for all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163712 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [12:32:45] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1007.eqiad.wmnet with OS bookworm [12:35:23] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - 2.8.15 upgrade (T398720) [12:35:27] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [12:35:42] (03PS1) 10Marostegui: db2171: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167852 (https://phabricator.wikimedia.org/T398928) [12:37:31] (03CR) 10Marostegui: [C:03+2] db2171: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167852 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [12:38:06] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:38:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2171 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78873 and previous config saved to /var/cache/conftool/dbconfig/20250710-123809-marostegui.json [12:39:34] (03PS1) 10KartikMistry: machinetranslationt: Use s3 model storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167854 (https://phabricator.wikimedia.org/T335491) [12:39:51] (03PS4) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) [12:39:53] (03CR) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [12:40:10] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 2.8.15 upgrade (T398720) [12:40:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78874 and previous config saved to /var/cache/conftool/dbconfig/20250710-124051-root.json [12:42:17] (03PS1) 10Michael Große: fix(StructuredTask): wrong order in resolving a deferred [extensions/GrowthExperiments] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167856 [12:42:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:44:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167856 (owner: 10Michael Große) [12:45:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78875 and previous config saved to /var/cache/conftool/dbconfig/20250710-124530-root.json [12:48:20] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [12:48:58] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [12:49:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar) [12:52:31] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:52:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:54:27] (03PS1) 10AikoChou: httpbb(liftwing): update edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1167858 (https://phabricator.wikimedia.org/T397013) [12:54:32] (03CR) 10Volans: [C:03+1] "LGTM, a question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [12:57:35] (03CR) 10Muehlenhoff: [C:03+2] Use thirdparty/jenkins on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar) [12:58:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:35] (03CR) 10Klausman: [C:03+1] httpbb(liftwing): update edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1167858 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [12:59:29] !log klausman@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:00:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:00:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:00:28] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300). Please do the needful. [13:00:28] MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:35] * MichaelG_WMF is here [13:00:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78876 and previous config saved to /var/cache/conftool/dbconfig/20250710-130036-root.json [13:04:02] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991850 (10cmooney) [13:06:12] !log installing ICU security updates [13:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:58] jouncebot: nowandnext [13:06:58] For the next 0 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300) [13:06:58] In 1 hour(s) and 23 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1430) [13:07:21] @moritzm do these security updates affect MediaWiki backports? [13:08:06] !log klausman@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:08:35] (03CR) 10Hashar: "releases2003 (Bookworm) now has `thirdparty/jenkins` in `/etc/apt/sources.list.d/thirdparty-jenkins.sources`." [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar) [13:10:39] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3533 MB (3% inode=98%): /tmp 3533 MB (3% inode=98%): /var/tmp 3533 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [13:15:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78878 and previous config saved to /var/cache/conftool/dbconfig/20250710-131541-root.json [13:17:29] (03Abandoned) 10David Caro: toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 (owner: 10David Caro) [13:18:22] MichaelG_WMF: no, these are unrelated to the current mediawiki deployments [13:18:23] (03PS2) 10Arnaudb: gerrit: enable monitoring for other instances [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) [13:18:23] (03CR) 10Arnaudb: "@jwodstrcil@wikimedia.org highlighted a missing scraping from our current config in https://phabricator.wikimedia.org/T398854#10991075 thi" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [13:18:29] (03CR) 10Hashar: "I have updated the reprepro documentation at https://wikitech.wikimedia.org/wiki/Jenkins#Get_the_package :)" [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar) [13:19:14] @moritzm ack, thanks for confirming! [13:20:38] (03PS10) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) [13:21:26] (03PS2) 10Ssingh: team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) [13:26:09] (03PS1) 10Vgutierrez: hiera: Point purged@eqsin to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1167860 [13:26:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167860 (owner: 10Vgutierrez) [13:27:29] (03CR) 10Filippo Giunchedi: [C:03+1] team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:27:55] (03CR) 10Ssingh: [C:03+2] team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:29:03] (03Merged) 10jenkins-bot: team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:29:25] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: disable OCSP for GTS certs [puppet] - 10https://gerrit.wikimedia.org/r/1167687 (https://phabricator.wikimedia.org/T399079) (owner: 10Ssingh) [13:29:42] (03CR) 10Fabfur: [C:03+1] hiera: Point purged@eqsin to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1167860 (owner: 10Vgutierrez) [13:29:57] (03CR) 10Vgutierrez: [C:03+2] hiera: Point purged@eqsin to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1167860 (owner: 10Vgutierrez) [13:30:01] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10991910 (10OKryva-WMF) I am Tran's EM. Approve the request. [13:30:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78879 and previous config saved to /var/cache/conftool/dbconfig/20250710-133047-root.json [13:30:48] (03PS1) 10David Caro: toolforge: rename the jobs-cli to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861 [13:33:04] (03PS1) 10Marostegui: db2211: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167862 (https://phabricator.wikimedia.org/T398928) [13:33:30] (03PS2) 10David Caro: toolforge: rename the jobs-cli and misctools to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861 [13:33:32] (03CR) 10Marostegui: [C:03+2] db2211: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167862 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [13:33:58] (03CR) 10CI reject: [V:04-1] toolforge: rename the jobs-cli and misctools to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861 (owner: 10David Caro) [13:34:14] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2211.codfw.wmnet with reason: Maintenance [13:34:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2211 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78880 and previous config saved to /var/cache/conftool/dbconfig/20250710-133418-marostegui.json [13:34:57] <_joe_> jouncebot: now [13:34:57] For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300) [13:35:35] (03PS3) 10David Caro: toolforge: rename the jobs-cli and misctools to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861 [13:36:14] _joe_: I will do MichaelG_WMF backport patch [13:36:24] hashar: thank you! [13:36:31] <_joe_> hashar: <3 [13:36:47] or did you have some urgent stuff to do on wikikube? [13:37:46] <_joe_> hashar: no, I was looking at what was in the calendar :) [13:37:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167856 (owner: 10Michael Große) [13:38:09] _joe_: great :) [13:38:21] (03CR) 10David Caro: [C:03+2] toolforge: rename the jobs-cli and misctools to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861 (owner: 10David Caro) [13:38:34] (03CR) 10David Caro: [C:03+2] "This is reverting the latest deploys, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1167861 (owner: 10David Caro) [13:39:50] (03Merged) 10jenkins-bot: fix(StructuredTask): wrong order in resolving a deferred [extensions/GrowthExperiments] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167856 (owner: 10Michael Große) [13:40:14] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1167856|fix(StructuredTask): wrong order in resolving a deferred]] [13:40:35] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [13:41:25] * MichaelG_WMF is ready to test with the debug extension whenever [13:41:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78881 and previous config saved to /var/cache/conftool/dbconfig/20250710-134150-root.json [13:42:17] !log hashar@deploy1003 migr, hashar: Backport for [[gerrit:1167856|fix(StructuredTask): wrong order in resolving a deferred]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:42:25] * MichaelG_WMF looks [13:42:46] (03PS1) 10Vgutierrez: Revert "hiera: Point purged@eqsin to main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1167864 [13:43:05] :) [13:45:03] hashar: I can confirm that this fixes the regression. Good to roll forward from my side 👍 [13:45:11] (03CR) 10Fabfur: [C:03+1] Revert "hiera: Point purged@eqsin to main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1167864 (owner: 10Vgutierrez) [13:45:57] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Point purged@eqsin to main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1167864 (owner: 10Vgutierrez) [13:46:03] !log hashar@deploy1003 migr, hashar: Continuing with sync [13:46:07] MichaelG_WMF: congratulations! [13:46:09] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS trixie [13:46:29] hashar: thank you for helping me out! 🙏 [13:46:47] !log upgrade spicerack on cumin2002 to 11.3.0 [13:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:29] !log klausman@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=inference,name=codfw [13:48:56] !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [13:49:50] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1008.eqiad.wmnet'] [13:49:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1036.eqiad.wmnet with reason: Maintenance [13:51:11] !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1002.eqiad.wmnet [13:51:25] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167856|fix(StructuredTask): wrong order in resolving a deferred]] (duration: 11m 10s) [13:52:36] MichaelG_WMF: the deploy is fully complete :] [13:52:38] jouncebot: now [13:52:38] For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300) [13:52:47] !log UTC afternoon backport window completed [13:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:49] hashar: cool, thanks! [13:53:01] _joe_: the backport window has completed :) [13:53:31] (03PS1) 10Elukey: profile::docker::reporter: add DSE and AUX clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167868 (https://phabricator.wikimedia.org/T397696) [13:53:41] andrew@cumin1003 upgrade-firmware (PID 1134872) is awaiting input [13:54:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167868 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [13:55:47] jouncebot: nowandnext [13:55:47] For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300) [13:55:47] In 0 hour(s) and 34 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1430) [13:56:20] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add DSE and AUX clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167868 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [13:56:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78882 and previous config saved to /var/cache/conftool/dbconfig/20250710-135656-root.json [13:57:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10992160 (10Jclark-ctr) >>! In T394333#10990367, @elukey wrote: > @Jclark-ctr IIUC it was a temporary failure right? yes that wa... [13:58:01] (03PS11) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) [13:58:14] !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [13:58:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10992166 (10STran) [14:00:05] !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1002.eqiad.wmnet [14:01:47] !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:02:58] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [14:03:00] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:03:16] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [14:03:20] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [14:04:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10992180 (10Jclark-ctr) ml-serve1015 is now racked into E 12 and added to netbox @elukey Let me know when you’re finished with any testing you want to do. I’ll stay... [14:04:57] !log andrew@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1008.eqiad.wmnet'] [14:08:50] (03CR) 10JHathaway: "thanks volans, fixes applied" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [14:10:39] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3486 MB (3% inode=98%): /tmp 3486 MB (3% inode=98%): /var/tmp 3486 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [14:11:16] (03PS3) 10Ssingh: P:cache::haproxy, C:haproxy, hiera: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) [14:11:27] (03CR) 10Ssingh: "rebased, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [14:12:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78883 and previous config saved to /var/cache/conftool/dbconfig/20250710-141202-root.json [14:12:36] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [14:15:21] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [14:15:42] (03PS1) 10Jelto: aptrepo: add gitlab package for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1167871 (https://phabricator.wikimedia.org/T384595) [14:16:24] (03CR) 10Btullis: [C:03+2] analytics: deprioritize druid MapReduce jobs if needed [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [14:16:26] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bookworm [14:17:52] (03PS2) 10Btullis: analytics: Absent rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1167224 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [14:17:55] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167224 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [14:18:03] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6235/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [14:20:04] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [14:21:10] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1167838 (https://phabricator.wikimedia.org/T372804) (owner: 10Arnaudb) [14:22:09] (03PS10) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 [14:23:00] (03CR) 10Arnaudb: [C:03+2] gerrit: enable gerrit.service and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167838 (https://phabricator.wikimedia.org/T372804) (owner: 10Arnaudb) [14:24:24] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [14:26:40] (03PS1) 10Elukey: preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) [14:27:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78884 and previous config saved to /var/cache/conftool/dbconfig/20250710-142707-root.json [14:29:20] (03PS11) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 [14:29:44] (03CR) 10CI reject: [V:04-1] preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1430) [14:30:15] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [14:30:37] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [14:30:46] (03CR) 10Jelto: "one question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [14:31:00] (03PS2) 10Elukey: preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) [14:31:01] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [14:31:46] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [14:32:18] (03CR) 10Btullis: [C:03+2] analytics: Absent rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1167224 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [14:33:10] (03PS3) 10Elukey: preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) [14:33:18] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:33:35] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [14:33:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [14:34:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:34:37] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage [14:34:57] (03CR) 10Volans: [C:03+1] "LGTM (requires spicerack to be released to all host before merging it)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [14:35:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1167871 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [14:37:10] (03CR) 10Elukey: [C:03+2] preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [14:38:23] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage [14:40:33] (03CR) 10Hnowlan: [C:03+2] hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [14:41:44] !log vgutierrez@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: no reason specified, no task ID specified] [14:41:47] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: no reason specified, no task ID specified] [14:45:20] (03PS1) 10Tiziano Fogli: nrpe::mon_srv: propagate NRPE migration_task to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/1167876 (https://phabricator.wikimedia.org/T359443) [14:54:13] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [14:54:33] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [14:56:16] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1008.eqiad.wmnet with OS bookworm [14:56:33] (03PS1) 10Btullis: Set 52 Hadoop nodes into decommissioning state [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160) [14:59:32] (03PS2) 10Btullis: Set 52 Hadoop nodes into decommissioning state [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160) [15:00:05] andre and jnuche: Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1500). Please do the needful. [15:00:20] jouncebot: I don't think there's much to triage [15:00:40] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6236/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [15:00:45] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [15:01:56] (03PS1) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T397926) [15:02:49] (03CR) 10CI reject: [V:04-1] [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy) [15:03:28] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [15:05:02] 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10992532 (10KartikMistry) Update: We've now staging server running using S3 model storage and observing logs... [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:35] (03CR) 10Filippo Giunchedi: [C:03+1] nrpe::mon_srv: propagate NRPE migration_task to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/1167876 (https://phabricator.wikimedia.org/T359443) (owner: 10Tiziano Fogli) [15:09:53] (03CR) 10Filippo Giunchedi: "+1 to what Jelto said" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [15:11:06] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on gerrit2003.wikimedia.org with reason: maintenance [15:11:42] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10992557 (10elukey) relevant conversation from IRC: ` I think the root problem is that on sretest2006 /var/lib/partman/devices is empty, it's the file wh... [15:13:35] (03PS2) 10Hnowlan: wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) [15:13:50] (03CR) 10Tiziano Fogli: [C:03+2] nrpe::mon_srv: propagate NRPE migration_task to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/1167876 (https://phabricator.wikimedia.org/T359443) (owner: 10Tiziano Fogli) [15:14:38] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [15:15:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:15:08] (03CR) 10Xcollazo: [C:03+1] "It hurts a bit to see 52 of my friends leave, but LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [15:15:12] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:15:46] (03PS1) 10JHathaway: reimage: use ipxe DHCP info, skip d-i DHCP [puppet] - 10https://gerrit.wikimedia.org/r/1167883 [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:58] (03PS2) 10JHathaway: reimage: use ipxe DHCP info, skip d-i DHCP [puppet] - 10https://gerrit.wikimedia.org/r/1167883 [15:17:56] (03PS1) 10Elukey: profile::docker::reporter: add Wikikube and ML serve prod clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696) [15:18:10] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [15:18:44] elukey@cumin2002 reimage (PID 296806) is awaiting input [15:20:22] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10992587 (10Dzahn) Gotcha, Tobi. Yea, seems no problem to do both in this ticket. [15:20:28] (03PS2) 10Elukey: profile::docker::reporter: add Wikikube and ML serve prod clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696) [15:20:57] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:21:06] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [15:21:55] (03PS3) 10Hnowlan: wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) [15:22:04] (03PS3) 10Cwhite: logstash: move grafana status_code field to the right place [puppet] - 10https://gerrit.wikimedia.org/r/1164524 (https://phabricator.wikimedia.org/T234565) [15:22:30] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [15:23:44] (03PS3) 10JHathaway: reimage: use ipxe DHCP info, skip d-i DHCP [puppet] - 10https://gerrit.wikimedia.org/r/1167883 [15:23:53] (03CR) 10Dzahn: "Will https monitoring actually work given that we currently get the "Forbidden" on https://gerrit-replica.wikimedia.org/r/monitoring ?" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [15:24:00] (03CR) 10Clément Goubert: [C:03+1] wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [15:25:28] !log upgrade spicerack to 11.3.0 on cumin100[2-3] [15:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:53] (03PS1) 10Volans: I/F: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 [15:25:53] (03PS1) 10Volans: o11y: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 [15:25:53] (03PS1) 10Volans: ServiceOps: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 [15:25:54] (03PS1) 10Volans: Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 [15:28:15] (03CR) 10Hnowlan: [C:03+2] wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [15:29:40] !log hnowlan@dns1004 START - running authdns-update [15:30:39] !log hnowlan@dns1004 END - running authdns-update [15:31:52] (03CR) 10CI reject: [V:04-1] logstash: move grafana status_code field to the right place [puppet] - 10https://gerrit.wikimedia.org/r/1164524 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:32:55] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [15:33:58] (03CR) 10Hnowlan: [C:03+2] trafficserver, cache: add config for edge routing of hcaptcha [puppet] - 10https://gerrit.wikimedia.org/r/1167670 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [15:34:27] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1164524 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:34:57] (03PS1) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T397926) [15:35:21] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis mediawikiwiki, testwiki in section s3 [15:35:28] (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 (owner: 10Volans) [15:36:10] (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans) [15:36:20] (03CR) 10CI reject: [V:04-1] Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy) [15:37:22] (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 (owner: 10Volans) [15:37:48] (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans) [15:38:51] (03CR) 10Cwhite: [C:03+2] logstash: move grafana status_code field to the right place [puppet] - 10https://gerrit.wikimedia.org/r/1164524 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:40:08] hnowlan: is the hcaptcha change ready for deploy? [15:40:49] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [15:40:50] cwhite: yes, please! [15:40:59] if you're already merging [15:41:05] will do :) [15:44:09] {done} [15:44:34] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167883 (owner: 10JHathaway) [15:44:34] (03PS3) 10Cwhite: logstash: flatten array of objects in stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/1164525 (https://phabricator.wikimedia.org/T234565) [15:49:25] fceratto@cumin1002 sanitize-wiki (PID 1333657) is awaiting input [15:49:41] (03PS1) 10Hnowlan: wikimedia: simplify hcaptcha subsubdomains [dns] - 10https://gerrit.wikimedia.org/r/1167891 (https://phabricator.wikimedia.org/T397841) [15:49:55] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10992657 (10jhathaway) would it be possible to setup a separate vrts server, that is configured with postfix, rather than replacing exim... [15:50:15] (03PS7) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) [15:51:00] (03CR) 10Ssingh: [C:03+1] wikimedia: simplify hcaptcha subsubdomains [dns] - 10https://gerrit.wikimedia.org/r/1167891 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [15:52:13] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [15:52:53] (03CR) 10Cathal Mooney: "LGTM, one question hits me but I think the logic works. Also if we do the conditional the way dcaro suggests is fine no preference on my " [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [15:54:25] !log refreshed YARN queues definition in production via https://phabricator.wikimedia.org/T399013#10992686 [15:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:12] (03PS1) 10Hnowlan: trafficserver, profile::hcaptcha: simplify subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1167893 (https://phabricator.wikimedia.org/T397841) [15:56:26] (03PS8) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) [15:56:51] (03CR) 10Hnowlan: [C:03+2] wikimedia: simplify hcaptcha subsubdomains [dns] - 10https://gerrit.wikimedia.org/r/1167891 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [15:57:10] !log hnowlan@dns1004 START - running authdns-update [15:58:01] !log hnowlan@dns1004 END - running authdns-update [15:58:53] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver, profile::hcaptcha: simplify subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1167893 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [15:58:58] (03CR) 10Clément Goubert: [C:03+1] trafficserver, profile::hcaptcha: simplify subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1167893 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [15:59:46] (03PS1) 10Máté Szabó: Configure Special:CreateAccount instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167896 (https://phabricator.wikimedia.org/T394744) [16:00:05] jhathaway and moritzm: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:31] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [16:00:53] (03CR) 10Btullis: [V:03+1 C:03+2] Set 52 Hadoop nodes into decommissioning state [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [16:02:38] (03CR) 10Hnowlan: [C:03+2] trafficserver, profile::hcaptcha: simplify subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1167893 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [16:04:13] (03PS1) 10Volans: Data Persistence: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 [16:05:04] (03CR) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [16:05:38] (03CR) 10JHathaway: "tested a UUID reimage successfully for sretest2001, in combination with 1167883" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [16:05:56] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10992727 (10Jhancock.wm) @elukey i got 2044 pingable. i set a few things on this one, including the password, in the idrac. i also got 2045 pingable. on this one i only... [16:07:15] (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans) [16:09:08] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10992745 (10SKivlehan-WMF) 05In progress→03Resolved I'm in! Marking as Resolved, thank you all for the assistance here. [16:10:52] (03CR) 10Jforrester: [C:03+1] "Good plan." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy) [16:11:18] (03CR) 10Jforrester: "00:00:19.370 1) InitialiseSettingsTest::testMustHaveConfigs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy) [16:11:50] (03PS1) 10FNegri: openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) [16:12:17] (03CR) 10CI reject: [V:04-1] openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri) [16:13:20] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [16:14:50] (03PS1) 10Federico Ceratto: sanitize-wiki: Support sections other than s5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1167895 (https://phabricator.wikimedia.org/T399178) [16:14:50] (03CR) 10Federico Ceratto: "Allows setting sections other than s5" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167895 (https://phabricator.wikimedia.org/T399178) (owner: 10Federico Ceratto) [16:15:18] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [16:15:24] (03PS2) 10FNegri: openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) [16:18:05] (03CR) 10CI reject: [V:04-1] openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri) [16:20:43] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:21:51] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:22:07] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:22:43] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:23:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:24:40] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:26:14] (03PS2) 10Jforrester: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy) [16:26:14] (03PS1) 10Jforrester: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T397926) [16:39:55] (03CR) 10Cwhite: [C:03+2] logstash: flatten array of objects in stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/1164525 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:42:49] (03PS9) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) [16:42:49] (03PS1) 10Andrew Bogott: cloudceph osd.yaml: update some nic names for Bookworm reimages [puppet] - 10https://gerrit.wikimedia.org/r/1167905 [16:46:16] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167905 (owner: 10Andrew Bogott) [16:46:32] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. I confirmed they are the names the system is now using." [puppet] - 10https://gerrit.wikimedia.org/r/1167905 (owner: 10Andrew Bogott) [16:50:14] (03CR) 10Andrew Bogott: [C:03+2] cloudceph osd.yaml: update some nic names for Bookworm reimages [puppet] - 10https://gerrit.wikimedia.org/r/1167905 (owner: 10Andrew Bogott) [16:50:43] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis mediawikiwiki, testwiki in section s3 [16:54:26] (03PS2) 10BryanDavis: puppetserver: check for rebase in puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) [16:55:47] (03CR) 10BryanDavis: puppetserver: check for rebase in puppetserver-deploy-code (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) (owner: 10BryanDavis) [17:00:05] bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1700) [17:01:00] Nothing to push out in my window this week [17:04:34] (03CR) 10Marostegui: [C:03+1] sanitize-wiki: Support sections other than s5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1167895 (https://phabricator.wikimedia.org/T399178) (owner: 10Federico Ceratto) [17:05:20] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm [17:07:10] (03PS5) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936) [17:09:58] (03PS6) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936) [17:10:32] (03CR) 10Herron: [C:03+1] "Note: We will need to manually clean the old pyrra configs that will be orphaned by this change" [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [17:12:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78886 and previous config saved to /var/cache/conftool/dbconfig/20250710-171214-root.json [17:14:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:19:04] (03CR) 10JHathaway: [C:03+1] puppetserver: check for rebase in puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) (owner: 10BryanDavis) [17:19:19] (03CR) 10Federico Ceratto: "LGTM, the small change in `--task` vs `--task-id` should not be an issue, also afaik Manuel tends to use `-t` anyways." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans) [17:19:47] (03CR) 10Federico Ceratto: [C:03+1] Data Persistence: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans) [17:21:43] (03CR) 10Daimona Eaytoy: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T397926) (owner: 10Jforrester) [17:21:59] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1049.eqiad.wmnet'] [17:22:17] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1049.eqiad.wmnet'] [17:25:23] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1049.eqiad.wmnet [17:25:34] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1049.eqiad.wmnet [17:27:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78887 and previous config saved to /var/cache/conftool/dbconfig/20250710-172719-root.json [17:28:25] (03PS2) 10Daimona Eaytoy: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester) [17:28:40] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1048.eqiad.wmnet with OS bookworm [17:28:42] (03PS3) 10Daimona Eaytoy: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester) [17:29:03] (03PS4) 10Daimona Eaytoy: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester) [17:29:30] (03PS3) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) [17:29:43] (03PS2) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) [17:30:35] (03CR) 10CI reject: [V:04-1] Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [17:33:03] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm [17:33:29] (03PS1) 10Daimona Eaytoy: Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 [17:35:00] (03CR) 10Ssingh: [V:03+1] "Given the incident today and in general, I will just merge this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [17:39:40] (03PS2) 10Daimona Eaytoy: Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 [17:39:52] (03CR) 10CI reject: [V:04-1] Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [17:42:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78889 and previous config saved to /var/cache/conftool/dbconfig/20250710-174225-root.json [17:42:32] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1048.eqiad.wmnet with OS bookworm [17:47:56] (03CR) 10BryanDavis: [V:03+1] "Seems to be working as hoped to power k8s-api.svc.zuul.eqiad1.wikimedia.cloud:" [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis) [17:54:16] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#10993075 (10cmooney) We may need to hold off on this for now. The requirement for jumbo frames poses a difficulty for the plan as the parent i... [17:54:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:55:28] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm [17:55:43] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bookworm [17:56:14] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10993082 (10cmooney) [17:56:22] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm [17:57:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78890 and previous config saved to /var/cache/conftool/dbconfig/20250710-175730-root.json [17:59:15] (03PS4) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) [18:00:45] (03CR) 10Dzahn: [C:03+2] aptrepo: add gitlab package for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1167871 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [18:00:46] (03PS3) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) [18:03:31] (03PS4) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists contain only SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) [18:05:51] (03CR) 10Daimona Eaytoy: "I made Iab79188f72664247d for another setting that can be migrated. I missed fishbowl wikis when originally writing that, which is why I g" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [18:15:52] (03PS3) 10Cwhite: logstash: use filter_on_templates_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1164526 (https://phabricator.wikimedia.org/T234565) [18:18:54] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1048.eqiad.wmnet with OS bookworm [18:19:28] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [18:26:15] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10993155 (10Jhancock.wm) [18:28:07] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1048.eqiad.wmnet with OS bullseye [18:28:23] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm [18:36:32] (03PS10) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [18:36:32] (03PS3) 10Jforrester: Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [18:37:02] (03CR) 10Jforrester: "PS10: Manual rebase. Let's land this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [18:38:14] (03CR) 10Daimona Eaytoy: "Thank you. Config diff LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [18:38:31] (03PS1) 10Andrew Bogott: cloudcephosd1035: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167914 (https://phabricator.wikimedia.org/T396651) [18:38:33] (03PS1) 10Andrew Bogott: cloudcephosd1036: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167915 (https://phabricator.wikimedia.org/T396651) [18:38:38] (03PS1) 10Andrew Bogott: cloudcephosd1037: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167916 (https://phabricator.wikimedia.org/T396651) [18:38:41] (03PS1) 10Andrew Bogott: cloudcephosd1038: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167917 (https://phabricator.wikimedia.org/T396651) [18:38:43] (03PS1) 10Andrew Bogott: cloudcephosd1039: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167918 (https://phabricator.wikimedia.org/T396651) [18:38:44] (03PS1) 10Andrew Bogott: cloudcephosd1040: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167919 (https://phabricator.wikimedia.org/T396651) [18:38:46] (03PS1) 10Andrew Bogott: cloudcephosd1041: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167920 (https://phabricator.wikimedia.org/T396651) [18:39:26] !log sukhe@cp5017:~$ sudo systemctl stop trafficserver.service && sudo traffic_server -C clear_cache && sudo systemctl start trafficserver.service [18:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:30] !log sukhe@cp5017:~$ sudo systemctl stop trafficserver.service && sudo traffic_server -C clear_cache && sudo systemctl start trafficserver.service: T399221 [18:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:34] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [18:39:47] !log clearing varnish and ATS cache on cp5017 before repooling eqsin: T399221 [18:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:06] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1035: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167914 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott) [18:42:31] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:42:45] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:43:06] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:43:18] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:43:46] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:43:58] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:44:18] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:44:29] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:44:45] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:44:53] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:47:09] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: arelion drained; traffic is going through ulsfo to codfw, T399221] [18:47:13] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [18:47:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: arelion drained; traffic is going through ulsfo to codfw, T399221] [18:48:58] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:49:08] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:49:58] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:50:08] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:50:17] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:50:45] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:50:54] !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:55:21] (03PS6) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) [18:55:52] (03CR) 10Dzahn: "amended to change "passive host" to "replica host"" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:58:35] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:58:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [18:58:46] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [18:59:31] !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1048.eqiad.wmnet with OS bookworm [19:00:02] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1035.eqiad.wmnet'] [19:00:12] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1035.eqiad.wmnet'] [19:00:18] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [19:00:28] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet'] [19:01:19] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [19:01:28] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [19:01:30] (03CR) 10Jforrester: "OK, final(?) review:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [19:02:15] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm [19:02:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10993245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eq... [19:04:31] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10993263 (10Dzahn) Hosts are not virtual, they are physical machines. So the biggest issue with that would be where to get hardware from... [19:05:10] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [19:05:21] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [19:07:39] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [19:07:47] !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [19:10:29] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: "SSD firmware fetch from DELL website not yet implemented" - https://phabricator.wikimedia.org/T399234 (10Andrew) 03NEW [19:15:12] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:19:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:22:09] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [19:24:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:28:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [19:46:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bookworm [19:46:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10993386 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.... [19:50:57] (03CR) 10BryanDavis: [V:03+1] "LGTM. I'm still not quite sure I understand why the flags are changing for Beta's votewiki, but we can chase that more if anyone ever find" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [19:53:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [19:55:04] (03PS1) 10Cwhite: logstash: fix gitlab event field type conflict [puppet] - 10https://gerrit.wikimedia.org/r/1167926 (https://phabricator.wikimedia.org/T234565) [19:57:28] (03CR) 10CI reject: [V:04-1] logstash: fix gitlab event field type conflict [puppet] - 10https://gerrit.wikimedia.org/r/1167926 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [19:59:05] (03PS2) 10Cwhite: logstash: fix gitlab event field type conflict [puppet] - 10https://gerrit.wikimedia.org/r/1167926 (https://phabricator.wikimedia.org/T234565) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T2000). [20:00:04] James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:08:50] (03PS3) 10LD: wmf-config/core-Permissions.php: sort keys alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167927 [20:09:51] (03CR) 10LD: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167927 (owner: 10LD) [20:12:00] (03CR) 10Cwhite: [C:03+2] logstash: fix gitlab event field type conflict [puppet] - 10https://gerrit.wikimedia.org/r/1167926 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:12:44] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@c558ea4]: Artifactct analytics-test [20:12:57] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@c558ea4]: Artifactct analytics-test (duration: 00m 13s) [20:13:37] !log aqu@deploy1003 Started deploy [airflow-dags/analytics@c558ea4]: Artifactct analytics / main [20:14:20] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics@c558ea4]: Artifactct analytics / main (duration: 00m 43s) [20:17:24] (03CR) 10LD: "JSON key order doesn't affect behavior, so Jenkins may not detect this as a meaningful change, but the keys were reordered alphabetically " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167927 (owner: 10LD) [20:21:54] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [20:24:43] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [20:24:54] !log root@cumin1003 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [20:25:15] !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1035.eqiad.wmnet [20:25:16] !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [20:25:19] !log root@cumin1003 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [20:30:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:30:26] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:31:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:35:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:35:26] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:36:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:37:34] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10993532 (10Arnoldokoth) @Dzahn Or we could repurpose a spare server (if available)? `miscweb` comes to mind... Or were those VMs? [20:39:26] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1035.eqiad.wmnet [20:39:29] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [20:40:13] (03PS1) 10Ahmon Dancy: logspam.pl: Avoid consolidation of wrapped error message [puppet] - 10https://gerrit.wikimedia.org/r/1167932 (https://phabricator.wikimedia.org/T399239) [20:40:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993548 (10RobH) [20:40:55] Argh, finally back online. [20:41:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [20:41:57] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [20:42:10] (03Merged) 10jenkins-bot: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [20:42:17] thanks for keeping that patch alive James_F :) [20:42:21] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1137480|Use `sul` dblist in InitialiseSettings]] [20:42:25] bd808: Thank you for working on it! [20:42:27] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet [20:42:46] Testing it on debug will be fun. Which of ~2000 settings on ~1000 wikis still work? [20:43:27] #someday we will have "user journey tests". #someday [20:43:51] * James_F has a bridge to sell you. [20:44:11] TBF, for Wikifunctions we do indeed have our Critical User Journeys with matching browser tests for each. [20:44:17] So it is possible. :-) [20:44:22] !log jforrester@deploy1003 jforrester, bd808: Backport for [[gerrit:1137480|Use `sul` dblist in InitialiseSettings]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:44:26] (03Abandoned) 10Ahmon Dancy: logspam: Consolidate several more persistent log messages [puppet] - 10https://gerrit.wikimedia.org/r/1056232 (owner: 10Ahmon Dancy) [20:44:33] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bookworm [20:44:39] step 1: get in in the APP. step 2: ???. step 3: PROFIT! [20:44:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1035.eqia... [20:48:37] OK, let's do it. [20:48:39] !log jforrester@deploy1003 jforrester, bd808: Continuing with sync [20:54:04] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137480|Use `sul` dblist in InitialiseSettings]] (duration: 11m 43s) [20:55:00] enwiki still shows the main_page. things must be fine! :) [20:55:23] WCPGW?! [20:55:25] Yeah. [20:56:08] (03CR) 10Jforrester: "Of course! I didn't want to deploy this alongside the parent, but I think this is now good to land." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [20:57:12] (03CR) 10Brennen Bearnes: [C:03+1] "Tested on mwlog1002; LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1167932 (https://phabricator.wikimedia.org/T399239) (owner: 10Ahmon Dancy) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T2100) [21:06:58] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [21:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:12:51] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [21:22:15] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10993697 (10Dzahn) @Arnoldokoth If there is a spare server, sure, but I am not sure there is one. Back in the days dcops had a spare poo... [21:26:44] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10993701 (10Dzahn) Well... or we could create a VM and try to install VRTS with postfix on that. If that works (where I'm not sure how t... [21:30:39] (03PS5) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) [21:30:52] (03CR) 10CI reject: [V:04-1] [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [21:31:11] (03PS5) 10Daimona Eaytoy: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester) [21:31:48] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: "SSD firmware fetch from DELL website not yet implemented" - https://phabricator.wikimedia.org/T399234#10993706 (10RobH) 05Open→03Resolved a:03RobH IRC Update: The file it was looking for didn't exist on the cumin1003 host, but does on cumin20... [21:32:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bookworm [21:32:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1035.eqiad.wm... [21:36:47] (03PS6) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) [21:47:09] (03PS7) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) [21:47:38] (03CR) 10Dzahn: [C:03+1] "nice:)" [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar) [21:47:57] (03CR) 10CI reject: [V:04-1] [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [21:49:01] (03Abandoned) 10Andrew Bogott: cloudcephosd1035: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167914 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott) [21:49:30] (03PS4) 10Cwhite: logstash: use filter_on_templates_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1164526 (https://phabricator.wikimedia.org/T234565) [21:55:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:55:49] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1036.eqiad.wmnet [21:55:49] (03CR) 10Cwhite: [C:03+2] logstash: use filter_on_templates_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1164526 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:59:31] (03PS1) 10Daimona Eaytoy: Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 [22:00:02] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1036.eqiad.wmnet [22:00:18] (03CR) 10CI reject: [V:04-1] Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [22:05:39] (03PS8) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) [22:06:30] (03CR) 10Daimona Eaytoy: "Config diff review:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [22:06:34] (03CR) 10CI reject: [V:04-1] [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [22:10:42] (03PS1) 10Cwhite: logstash: remove filter_on_templates v1 [puppet] - 10https://gerrit.wikimedia.org/r/1167942 (https://phabricator.wikimedia.org/T234565) [22:10:44] (03PS1) 10Cwhite: logstash: rename filter-on-templates.rb [puppet] - 10https://gerrit.wikimedia.org/r/1167943 (https://phabricator.wikimedia.org/T234565) [22:13:06] (03PS1) 10Zabe: Fix categorylinks read new query for excluded categories [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167944 (https://phabricator.wikimedia.org/T385890) [22:13:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1036.eqiad.wmnet [22:13:39] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1036.eqiad.wmnet [22:16:39] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bookworm [22:16:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1036.eqia... [22:20:57] jouncebot: nowandnext [22:20:57] No deployments scheduled for the next 7 hour(s) and 39 minute(s) [22:20:57] In 7 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T0600) [22:21:01] (03CR) 10Zabe: [C:03+2] Fix categorylinks read new query for excluded categories [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167944 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe) [22:21:55] (03Merged) 10jenkins-bot: Fix categorylinks read new query for excluded categories [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167944 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe) [22:22:57] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167944|Fix categorylinks read new query for excluded categories (T385890)]] [22:23:02] T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890 [22:24:56] !log zabe@deploy1003 zabe: Backport for [[gerrit:1167944|Fix categorylinks read new query for excluded categories (T385890)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:25:39] !log zabe@deploy1003 zabe: Continuing with sync [22:30:56] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167944|Fix categorylinks read new query for excluded categories (T385890)]] (duration: 07m 59s) [22:31:00] T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890 [22:39:27] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage [22:42:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:43:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage [22:52:37] (03Abandoned) 10Andrew Bogott: cloudcephosd1036: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167915 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott) [23:02:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bookworm [23:03:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1036.eqiad.wm... [23:03:17] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1037.eqiad.wmnet [23:08:59] andrew@cumin2002 upgrade-firmware (PID 535018) is awaiting input [23:09:06] 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#10993801 (10Dzahn) To my surprise it seems like profile::httpd is only included in role::config_master anymore but that's it. [23:13:53] (03PS1) 10Dzahn: profile::httpd: include prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434) [23:15:12] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:21:50] (03CR) 10Dzahn: [V:03+1] "it is used on far fewer roles anymore than it used be. seems like in prod it's just puppetserver and config-master, where "just" is relati" [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434) (owner: 10Dzahn) [23:38:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167966 [23:38:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167966 (owner: 10TrainBranchBot) [23:40:35] 06SRE: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10993827 (10Dzahn) Should we just talk to Marko directly and ask if he uses this? Then it becomes clear if a new NDA should be created or just access removed. https://www.linkedin.com/in/doorman [23:44:01] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1037.eqiad.wmnet [23:49:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:50:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167966 (owner: 10TrainBranchBot) [23:57:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1037.eqiad.wmnet [23:57:56] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1037.eqiad.wmnet [23:59:04] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bookworm [23:59:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037.eqia...