[00:00:03] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167313 (owner: 10TrainBranchBot)
[00:08:08] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167714
[00:08:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167714 (owner: 10TrainBranchBot)
[00:11:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:12:09] <wikibugs>	 (03PS1) 10Ssingh: team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619)
[00:13:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[00:22:41] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1165832/6229/" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[00:22:51] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[00:32:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167714 (owner: 10TrainBranchBot)
[00:32:32] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "all 3 servers have the new firewall rule and the gerrit config change. I did a service restart on gerrit2002 (to verify there is no syntax" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[00:35:18] <wikibugs>	 (03PS5) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833)
[00:36:37] <wikibugs>	 (03CR) 10Dzahn: "changed "standby" to "spare" host to address concerns about confusing naming" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[00:36:59] <wikibugs>	 (03CR) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[00:39:45] <wikibugs>	 (03CR) 10Dzahn: "the linked task is currently stalled and we have agreed to only do this once we have a real decision there. so this code change is also st" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[00:41:22] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:41:52] <wikibugs>	 (03Abandoned) 10Dzahn: gerrit: replace hardcoded host name and codfw string for replica [puppet] - 10https://gerrit.wikimedia.org/r/1129919 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[00:42:16] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:42:40] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[00:45:41] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "If we agree on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129920 and once gerrit2002 is down.. I would then make a new patch to" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[00:48:06] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:48:12] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:50:24] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "same for the replica settings. once we drop gerrit2002 and only gerrit2003 is left we can replace the host name string with the replica_ho" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[00:57:36] <icinga-wm>	 RECOVERY - Disk space on an-worker1082 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1082&var-datasource=eqiad+prometheus/ops
[01:39:27] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[01:53:00] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[01:53:30] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.dhcp for host cloudcephosd1006.eqiad.wmnet
[01:55:03] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudcephosd1006.eqiad.wmnet
[02:03:50] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:03:56] <logmsgbot>	 !log root@cumin1003 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:03:59] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:04:03] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:04:27] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:11:18] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:28:51] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:29:07] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:29:43] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:37:27] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:39:23] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:46:47] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[02:53:22] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[03:01:35] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[03:01:50] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[03:15:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:17:49] <logmsgbot>	 andrew@cumin1003 reimage (PID 1068697) is awaiting input
[03:30:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[03:35:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[03:36:49] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[03:37:12] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[03:50:44] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10990315 (10VRiley-WMF) @Marostegui I have carved up the RAID into a RAID 10 and reimaged these servers. Would you be able to check it to see if it works out for you?
[03:55:25] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage
[03:58:40] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage
[04:10:03] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[04:11:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:12:51] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[04:13:10] <logmsgbot>	 !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[04:14:17] <logmsgbot>	 !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[04:16:19] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[04:16:28] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bookworm
[04:17:20] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[04:18:01] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[04:29:11] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[04:32:45] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[04:32:56] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[04:46:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:51:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:55:56] <logmsgbot>	 !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[04:56:42] <logmsgbot>	 !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[04:56:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:57:49] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[04:58:36] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[05:01:44] <jinxer-wm>	 RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:11:55] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Bugfixes for dependents, rename [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167732
[05:17:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfixes for dependents, rename [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167732 (owner: 10Giuseppe Lavagetto)
[05:21:10] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[05:21:42] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1003"
[05:21:44] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1003
[05:22:18] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1003
[05:22:19] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1003"
[05:25:20] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] machinetranslation: staging: Update MinT to 2025-07-09-124154-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167608 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry)
[05:26:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:27:02] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: staging: Update MinT to 2025-07-09-124154-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167608 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry)
[05:31:24] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 252.08 ms
[05:31:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:38:08] <kart_>	 Quick deploy of MinT on staging..
[05:38:23] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[05:54:19] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0600)
[06:09:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2001.codfw.wmnet, wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:10:48] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:12:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10990367 (10elukey) @Jclark-ctr IIUC it was a temporary failure right?
[06:14:48] <wikibugs>	 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10990368 (10KartikMistry) Status update:  We're testing the `entrypoint.sh` in the staging (using `values-staging.yaml`). Currentl...
[06:21:14] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10990381 (10Marostegui) Looking good on both hosts @VRiley-WMF! Thank you so much! `  VD LIST : =======  --------------------------------------------------------------- DG/VD TYPE...
[06:22:02] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10990382 (10Marostegui) 05Open→03Resolved
[06:25:57] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1036.eqiad.wmnet with reason: Maintenance
[06:28:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access of vpm [puppet] - 10https://gerrit.wikimedia.org/r/1167735
[06:30:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access of vpm [puppet] - 10https://gerrit.wikimedia.org/r/1167735 (owner: 10Muehlenhoff)
[06:35:31] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2228.codfw.wmnet with reason: Maintenance
[06:35:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2228 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78846 and previous config saved to /var/cache/conftool/dbconfig/20250710-063535-marostegui.json
[06:36:15] <wikibugs>	 (03PS1) 10Marostegui: db2228: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167736 (https://phabricator.wikimedia.org/T398928)
[06:36:17] <wikibugs>	 (03PS3) 10Muehlenhoff: memcached::instance: Remove support for Ferm-only syntax [puppet] - 10https://gerrit.wikimedia.org/r/1161511
[06:36:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2228: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167736 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[06:37:06] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "this is one of those changes where the context is way longer than the change 😄 thanks @dzahn@wikimedia.org it looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[06:39:05] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2228.codfw.wmnet with reason: Maintenance
[06:43:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167240 (owner: 10Muehlenhoff)
[06:44:11] <wikibugs>	 (03PS1) 10Marostegui: db1210: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167737 (https://phabricator.wikimedia.org/T398928)
[06:44:38] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[06:44:46] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[06:44:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1210: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167737 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[06:45:55] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[06:45:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1210 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78847 and previous config saved to /var/cache/conftool/dbconfig/20250710-064558-marostegui.json
[06:46:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78848 and previous config saved to /var/cache/conftool/dbconfig/20250710-064605-root.json
[06:47:08] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 18.0
[06:49:10] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[06:50:30] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff)
[06:52:08] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru and A:cp - 2.8.15 upgrade (T398720)
[06:52:11] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[06:52:38] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[06:53:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78849 and previous config saved to /var/cache/conftool/dbconfig/20250710-065350-root.json
[06:55:37] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[06:58:30] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[06:58:52] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[06:59:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:59:26] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru and A:cp - 2.8.15 upgrade (T398720)
[06:59:29] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0700)
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:39] <moritzm>	 !log installing libbpf security updates
[07:00:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78851 and previous config saved to /var/cache/conftool/dbconfig/20250710-070111-root.json
[07:03:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libbpf [puppet] - 10https://gerrit.wikimedia.org/r/1167738
[07:06:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libbpf [puppet] - 10https://gerrit.wikimedia.org/r/1167738 (owner: 10Muehlenhoff)
[07:08:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78852 and previous config saved to /var/cache/conftool/dbconfig/20250710-070855-root.json
[07:10:28] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[07:15:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:16:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78853 and previous config saved to /var/cache/conftool/dbconfig/20250710-071616-root.json
[07:17:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152 (10Addshore) 03NEW
[07:17:55] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[07:18:18] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Maintenance
[07:21:21] <marostegui>	 haproxy alerts will be expected
[07:21:57] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1213 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1167741 (https://phabricator.wikimedia.org/T399060)
[07:22:03] <wikibugs>	 (03PS1) 10Elukey: machinetranslation: add snippet to fetch private env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167742 (https://phabricator.wikimedia.org/T335491)
[07:22:26] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:22:28] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:23:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] machinetranslation: add snippet to fetch private env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167742 (https://phabricator.wikimedia.org/T335491) (owner: 10Elukey)
[07:23:27] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Move db1213 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1167741 (https://phabricator.wikimedia.org/T399060)
[07:24:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78855 and previous config saved to /var/cache/conftool/dbconfig/20250710-072401-root.json
[07:25:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Move db1213 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1167741 (https://phabricator.wikimedia.org/T399060) (owner: 10Marostegui)
[07:25:48] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[07:28:15] <wikibugs>	 (03PS2) 10Elukey: machinetranslation: add snippet to fetch private env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167742 (https://phabricator.wikimedia.org/T335491)
[07:29:09] <hashar>	 !log Restarting CI Jenkins
[07:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2228 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78856 and previous config saved to /var/cache/conftool/dbconfig/20250710-073123-root.json
[07:36:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] machinetranslation: add snippet to fetch private env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167742 (https://phabricator.wikimedia.org/T335491) (owner: 10Elukey)
[07:39:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78857 and previous config saved to /var/cache/conftool/dbconfig/20250710-073907-root.json
[07:39:46] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru and A:cp - 2.8.15 upgrade (T398720)
[07:39:49] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[07:39:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: disable OCSP for GTS certs [puppet] - 10https://gerrit.wikimedia.org/r/1167687 (https://phabricator.wikimedia.org/T399079) (owner: 10Ssingh)
[07:39:57] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-d1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-d1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:40:40] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy, C:haproxy, hiera: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh)
[07:41:47] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] nagios_common: remove check_ssl_cdn_ocsp* [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh)
[07:43:03] <wikibugs>	 (03PS1) 10Marostegui: db2178: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167743 (https://phabricator.wikimedia.org/T398928)
[07:43:12] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "brett, FYI icinga checks get applied on alert hosts, so PCC would need to include alert1002.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh)
[07:43:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2178: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167743 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[07:44:25] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru and A:cp - 2.8.15 upgrade (T398720)
[07:44:29] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2178.codfw.wmnet with reason: Maintenance
[07:44:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2178 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78858 and previous config saved to /var/cache/conftool/dbconfig/20250710-074432-marostegui.json
[07:44:54] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: sync
[07:45:40] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[07:47:28] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: sync
[07:50:59] <vgutierrez>	 !log switching to upload cert globally on upload CDN cluster - T394484
[07:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:02] <stashbot>	 T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484
[07:52:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78859 and previous config saved to /var/cache/conftool/dbconfig/20250710-075202-root.json
[07:54:52] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1047.eqiad.wmnet with reason: Maintenance
[07:55:09] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Netbox: add limit to rate [alerts] - 10https://gerrit.wikimedia.org/r/1167633 (owner: 10Slyngshede)
[07:55:11] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[07:57:24] <wikibugs>	 (03Merged) 10jenkins-bot: Netbox: add limit to rate [alerts] - 10https://gerrit.wikimedia.org/r/1167633 (owner: 10Slyngshede)
[07:59:42] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott)
[07:59:59] <wikibugs>	 (03CR) 10David Caro: Cloudcephosd1048: Configure ceph with a single nic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott)
[08:00:05] <jouncebot>	 andre and jnuche: That opportune time for a MediaWiki train - Utc-0 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0800).
[08:00:45] <moritzm>	 !log installing python-urllib3 security updates
[08:00:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:28] <andre>	 o/
[08:02:47] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167819 (https://phabricator.wikimedia.org/T392179)
[08:02:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167819 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot)
[08:03:28] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:03:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:03:41] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167819 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot)
[08:05:52] <klausman>	 !log Depooling Liftwing prod in codfw so we can roll out some changes that restart all services (cf. T398533)
[08:05:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:56] <stashbot>	 T398533: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533
[08:06:52] <wikibugs>	 (03CR) 10David Caro: "The pcc LGTM, just a note on the datastructure there" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott)
[08:07:00] <logmsgbot>	 !log klausman@cumin1002 conftool action : get/pooled; selector: dnsdisc=inference,name=codfw
[08:07:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78860 and previous config saved to /var/cache/conftool/dbconfig/20250710-080708-root.json
[08:07:41] <logmsgbot>	 !log klausman@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=inference,name=codfw
[08:09:47] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158 (10LSobanski) 03NEW
[08:10:14] <moritzm>	 !log installing containerd security updates
[08:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:17] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: ProbeDown (instance data-gateway-staging:30443) - https://phabricator.wikimedia.org/T399159 (10LSobanski) 03NEW
[08:10:28] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: ProbeDown (instance data-gateway-staging:30443) - https://phabricator.wikimedia.org/T399159#10990844 (10LSobanski) The alert is firing for both eqiad and codfw.
[08:10:59] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) - https://phabricator.wikimedia.org/T399160 (10LSobanski) 03NEW
[08:11:28] <logmsgbot>	 !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.9  refs T392179
[08:11:32] <stashbot>	 T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179
[08:11:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:12:13] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - 2.8.15 upgrade (T398720)
[08:12:16] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[08:15:07] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - 2.8.15 upgrade (T398720)
[08:16:09] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161 (10LSobanski) 03NEW
[08:21:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] memcached::instance: Remove support for Ferm-only syntax [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff)
[08:22:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78861 and previous config saved to /var/cache/conftool/dbconfig/20250710-082213-root.json
[08:26:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] httpbb: Rebuild for Bookworm [software/httpbb] - 10https://gerrit.wikimedia.org/r/1146585 (https://phabricator.wikimedia.org/T393711) (owner: 10Muehlenhoff)
[08:27:12] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1092195 (https://phabricator.wikimedia.org/T135991)
[08:30:04] <wikibugs>	 (03PS2) 10Muehlenhoff: mariadb::ferm_wmcs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1037766
[08:30:08] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[08:30:28] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037766 (owner: 10Muehlenhoff)
[08:31:18] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[08:33:15] <wikibugs>	 (03PS5) 10Muehlenhoff: cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556
[08:37:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78863 and previous config saved to /var/cache/conftool/dbconfig/20250710-083719-root.json
[08:40:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[08:40:31] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[08:41:02] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff)
[08:41:31] <wikibugs>	 (03CR) 10Elukey: "Left some nits and high level comments, the work is really great and the new functionality is what we need. I didn't test the command but " [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[08:45:18] <moritzm>	 !log installing setuptools security updates
[08:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:58] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10990931 (10ABran-WMF) a:03ABran-WMF
[08:51:17] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - 2.8.15 upgrade (T398720)
[08:51:22] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[08:53:05] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - 2.8.15 upgrade (T398720)
[08:59:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you and see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) (owner: 10Jcrespo)
[09:02:25] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:02:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:03:17] <wikibugs>	 (03PS1) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696)
[09:12:00] <wikibugs>	 (03PS2) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696)
[09:12:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Update db2240 T397163', diff saved to https://phabricator.wikimedia.org/P78865 and previous config saved to /var/cache/conftool/dbconfig/20250710-091250-fceratto.json
[09:12:55] <stashbot>	 T397163: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T397163
[09:13:12] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6231/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[09:14:07] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2161 gradually with 4 steps - Pooling in
[09:14:11] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2161 gradually with 4 steps - Pooling in
[09:14:47] <wikibugs>	 (03PS1) 10Klausman: httpbb: Add missing machinery to deplot article-models test file [puppet] - 10https://gerrit.wikimedia.org/r/1167827
[09:14:57] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service cr1-magru.wikimedia.org:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:15:06] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2240 gradually with 4 steps - Pooling in
[09:15:10] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2240 gradually with 4 steps - Pooling in
[09:16:24] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] httpbb: Add missing machinery to deplot article-models test file [puppet] - 10https://gerrit.wikimedia.org/r/1167827 (owner: 10Klausman)
[09:19:55] <wikibugs>	 (03CR) 10Volans: "It's removing the `--filter-file /etc/docker-report/k8s_registry_rules.ini` from the registry one, expected?" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[09:19:57] <jinxer-wm>	 FIRING: [3x] CertAlmostExpired: Certificate for service cr1-magru.wikimedia.org:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:21:47] <wikibugs>	 (03PS2) 10Klausman: httpbb: Add missing machinery to deploy some tests [puppet] - 10https://gerrit.wikimedia.org/r/1167827
[09:24:00] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1167827 (owner: 10Klausman)
[09:24:07] <wikibugs>	 (03PS3) 10Hashar: Use thirdparty/jenkins on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127)
[09:24:08] <wikibugs>	 (03CR) 10Hashar: "Moritz asked for the rename in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137361/comments/e85643f6_01ddab1a" [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar)
[09:24:11] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] httpbb: Add missing machinery to deploy some tests [puppet] - 10https://gerrit.wikimedia.org/r/1167827 (owner: 10Klausman)
[09:27:32] <wikibugs>	 (03PS1) 10Hnowlan: Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828
[09:27:43] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6232/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167827 (owner: 10Klausman)
[09:28:36] <wikibugs>	 (03CR) 10Jgiannelos: "Needs version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 (owner: 10Hnowlan)
[09:28:59] <wikibugs>	 (03PS2) 10Hnowlan: Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828
[09:31:26] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:31:58] <claime>	 dd
[09:32:10] <claime>	 >_>
[09:32:35] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779)
[09:34:57] <wikibugs>	 (03CR) 10Klausman: httpbb(liftwing): add edit-check tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos)
[09:35:44] <wikibugs>	 (03PS3) 10Hnowlan: Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828
[09:36:21] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 (owner: 10Hnowlan)
[09:36:58] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] "I reset the failed service, but it's not even supposed to try to start..." [puppet] - 10https://gerrit.wikimedia.org/r/1166213 (owner: 10Clément Goubert)
[09:39:05] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 (owner: 10Hnowlan)
[09:40:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167828 (owner: 10Hnowlan)
[09:43:22] <moritzm>	 !log installing initramfs-tools bugfix updates from Bookworm point release
[09:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:04] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[09:44:32] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[09:45:38] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[09:45:49] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[09:56:29] <wikibugs>	 (03PS1) 10Jgiannelos: changeprop: Fix file exclusion rule regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167830
[09:56:30] <wikibugs>	 (03CR) 10Jcrespo: "Some comments about what to do next." [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) (owner: 10Jcrespo)
[09:56:59] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779)
[09:57:07] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] changeprop: Fix file exclusion rule regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167830 (owner: 10Jgiannelos)
[09:57:55] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos)
[10:00:00] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] changeprop: Fix file exclusion rule regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167830 (owner: 10Jgiannelos)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1000)
[10:01:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399097#10991127 (10cmooney) So yeah this continued to bounce after that yesterday, eventually going hard down and remains so. ` Jul...
[10:01:37] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: Fix file exclusion rule regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167830 (owner: 10Jgiannelos)
[10:02:04] <wikibugs>	 (03PS6) 10Jcrespo: prometheus: Proof of concept of a nrpe to prometheus translation wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T350360)
[10:04:23] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#10991132 (10Ladsgroup) As WMF sponsor. This request has my support. I don't know what the policy is these days but if it needs a staff sponsor, it has mine
[10:04:53] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[10:05:02] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[10:05:08] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[10:05:19] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[10:05:23] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[10:05:37] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[10:05:42] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[10:05:48] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[10:15:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10991171 (10MoritzMuehlenhoff)
[10:15:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10991173 (10MoritzMuehlenhoff)
[10:24:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[10:24:44] <jinxer-wm>	 Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[10:33:16] <elukey>	 !log kafka preferred-replica-election on kafka-main2010
[10:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:57] <jinxer-wm>	 FIRING: [5x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:39:57] <jinxer-wm>	 FIRING: [7x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:44:57] <jinxer-wm>	 FIRING: [9x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:45:10] <logmsgbot>	 jelto@cumin1003 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade.
[10:46:23] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Change backups host [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172)
[10:46:47] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui)
[10:48:10] <logmsgbot>	 jelto@cumin1003 upgrade (PID 1090856) is awaiting input
[10:49:57] <jinxer-wm>	 FIRING: [11x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:50:39] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3646 MB (3% inode=98%): /tmp 3646 MB (3% inode=98%): /var/tmp 3646 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[10:54:57] <jinxer-wm>	 FIRING: [13x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:55:52] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "Please let me know when the switchover happens (can be after the fact, don't depend on me) to make sure service is ok after it." [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui)
[10:56:25] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Will do, aiming for Monday morning" [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui)
[10:57:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:58:32] <wikibugs>	 (03PS1) 10Jgiannelos: changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834
[10:58:50] <wikibugs>	 (03PS2) 10Jgiannelos: changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834
[10:58:51] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:59:57] <jinxer-wm>	 FIRING: [15x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:03:26] <wikibugs>	 (03PS3) 10Jgiannelos: changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 (https://phabricator.wikimedia.org/T397750)
[11:04:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1039', diff saved to https://phabricator.wikimedia.org/P78867 and previous config saved to /var/cache/conftool/dbconfig/20250710-110408-marostegui.json
[11:04:15] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1039.eqiad.wmnet with reason: Maintenance
[11:04:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[11:04:44] <jinxer-wm>	 Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[11:04:57] <jinxer-wm>	 FIRING: [17x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:06:11] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková)
[11:06:36] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 18.0
[11:06:46] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[11:06:58] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos)
[11:09:27] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp - 2.8.15 upgrade (T398720)
[11:09:31] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[11:09:33] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6233/co" [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos)
[11:09:38] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] trafficserver, cache: add config for edge routing of hcaptcha [puppet] - 10https://gerrit.wikimedia.org/r/1167670 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[11:09:57] <jinxer-wm>	 FIRING: [19x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:09:59] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos)
[11:14:13] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - 2.8.15 upgrade (T398720)
[11:14:57] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:15:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:16:51] <wikibugs>	 (03PS3) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696)
[11:18:05] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6234/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[11:19:38] <wikibugs>	 07sre-alert-triage, 10Maps, 06serviceops: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158#10991406 (10Clement_Goubert) This is on a `maps-test` server, maybe alert severity should be brought down. Anyhow, tagging #maps project for follo...
[11:21:25] <wikibugs>	 (03CR) 10Elukey: "Should be fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[11:21:51] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1007.eqiad.wmnet']
[11:24:04] <wikibugs>	 (03PS1) 10Klausman: httpbb: drop extraneous `files/` path element [puppet] - 10https://gerrit.wikimedia.org/r/1167836
[11:24:18] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] httpbb: drop extraneous `files/` path element [puppet] - 10https://gerrit.wikimedia.org/r/1167836 (owner: 10Klausman)
[11:25:06] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos)
[11:27:19] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: Simplify pcs rules, use purge instead of pregen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167834 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos)
[11:29:40] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1007.eqiad.wmnet']
[11:30:36] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[11:30:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180 (10cmooney) 03NEW p:05Triage→03Medium
[11:30:43] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bookworm
[11:30:48] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[11:31:44] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#10991485 (10cmooney) 05Stalled→03Resolved a:03cmooney I am going to close this one (please ping me if that is hasty!) as I've o...
[11:33:17] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart
[11:33:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991500 (10cmooney)
[11:34:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10991503 (10Tobi_WMDE_SW) >>! In T398686#10985976, @Dzahn wrote: > @Tobi_WMDE_SW and/or @sowmya.guru, is this request only to add a new approver or is it _also_ for access for...
[11:34:58] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[11:35:12] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[11:35:17] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[11:35:33] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[11:39:19] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0)
[11:41:18] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1007.eqiad.wmnet with OS bookworm
[11:44:28] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bookworm
[11:46:09] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis mediawikiwiki, testwiki in section s5
[11:46:36] <wikibugs>	 (03PS1) 10Marostegui: db1200: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167837 (https://phabricator.wikimedia.org/T398928)
[11:47:12] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1200: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167837 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[11:47:36] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[11:47:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1200 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78869 and previous config saved to /var/cache/conftool/dbconfig/20250710-114739-marostegui.json
[11:48:59] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: enable gerrit.service and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167838 (https://phabricator.wikimedia.org/T372804)
[11:49:43] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1007.eqiad.wmnet with OS bookworm
[11:50:00] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: enable gerrit.service and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167838 (https://phabricator.wikimedia.org/T372804)
[11:51:31] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis mediawikiwiki, testwiki in section s5
[11:51:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[11:52:02] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis mediawikiwiki, testwiki in section s5
[11:52:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10991577 (10cmooney) I created the below task to continue the discussion of how we set up the interfaces for these hosts, and cop...
[11:53:16] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bookworm
[11:53:34] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp - 2.8.15 upgrade (T398720)
[11:53:37] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[11:55:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78870 and previous config saved to /var/cache/conftool/dbconfig/20250710-115534-root.json
[11:56:23] <logmsgbot>	 fceratto@cumin1002 sanitize-wiki (PID 1181071) is awaiting input
[11:56:59] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - 2.8.15 upgrade (T398720)
[11:59:59] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 2.8.15 upgrade (T398720)
[12:00:02] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1200)
[12:00:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991587 (10cmooney)
[12:01:47] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - 2.8.15 upgrade (T398720)
[12:01:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991590 (10cmooney)
[12:02:44] <logmsgbot>	 fceratto@cumin1002 sanitize-wiki (PID 1181071) is awaiting input
[12:06:18] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167565 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou)
[12:06:18] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis mediawikiwiki, testwiki in section s5
[12:07:51] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update edit-check image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167565 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou)
[12:10:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78871 and previous config saved to /var/cache/conftool/dbconfig/20250710-121039-root.json
[12:11:24] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage
[12:14:02] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage
[12:15:48] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis mediawikiwiki, testwiki in section s3
[12:17:15] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:25] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:56] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitize-wiki (exit_code=97) Managing sanitization for wikis mediawikiwiki, testwiki in section s3
[12:18:05] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:15] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:21:25] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:22:15] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:24:17] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:25:11] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:25:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78872 and previous config saved to /var/cache/conftool/dbconfig/20250710-122545-root.json
[12:27:07] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:27:09] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 4.513 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:27:15] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.433 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:30:39] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3616 MB (3% inode=98%): /tmp 3616 MB (3% inode=98%): /var/tmp 3616 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[12:32:17] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng: disable tag->sha256 for all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163712 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[12:32:45] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1007.eqiad.wmnet with OS bookworm
[12:35:23] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - 2.8.15 upgrade (T398720)
[12:35:27] <stashbot>	 T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720
[12:35:42] <wikibugs>	 (03PS1) 10Marostegui: db2171: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167852 (https://phabricator.wikimedia.org/T398928)
[12:37:31] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2171: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167852 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[12:38:06] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance
[12:38:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2171 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78873 and previous config saved to /var/cache/conftool/dbconfig/20250710-123809-marostegui.json
[12:39:34] <wikibugs>	 (03PS1) 10KartikMistry: machinetranslationt: Use s3 model storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167854 (https://phabricator.wikimedia.org/T335491)
[12:39:51] <wikibugs>	 (03PS4) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696)
[12:39:53] <wikibugs>	 (03CR) 10Elukey: profile::docker::reporter: add wikikube-staging and ml-staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[12:40:10] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 2.8.15 upgrade (T398720)
[12:40:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78874 and previous config saved to /var/cache/conftool/dbconfig/20250710-124051-root.json
[12:42:17] <wikibugs>	 (03PS1) 10Michael Große: fix(StructuredTask): wrong order in resolving a deferred [extensions/GrowthExperiments] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167856
[12:42:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:44:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167856 (owner: 10Michael Große)
[12:45:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78875 and previous config saved to /var/cache/conftool/dbconfig/20250710-124530-root.json
[12:48:20] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[12:48:58] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add wikikube-staging and ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/1167824 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[12:49:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar)
[12:52:31] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:52:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:54:27] <wikibugs>	 (03PS1) 10AikoChou: httpbb(liftwing): update edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1167858 (https://phabricator.wikimedia.org/T397013)
[12:54:32] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, a question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[12:57:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Use thirdparty/jenkins on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar)
[12:58:15] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:58:25] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:58:35] <wikibugs>	 (03CR) 10Klausman: [C:03+1] httpbb(liftwing): update edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1167858 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou)
[12:59:29] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:00:05] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:00:15] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:00:28] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300). Please do the needful.
[13:00:28] <jouncebot>	 MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:35] * MichaelG_WMF is here
[13:00:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78876 and previous config saved to /var/cache/conftool/dbconfig/20250710-130036-root.json
[13:04:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991850 (10cmooney)
[13:06:12] <moritzm>	 !log installing ICU security updates
[13:06:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:58] <MichaelG_WMF>	 jouncebot: nowandnext
[13:06:58] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300)
[13:06:58] <jouncebot>	 In 1 hour(s) and 23 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1430)
[13:07:21] <MichaelG_WMF>	 @moritzm do these security updates affect MediaWiki backports?
[13:08:06] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:08:35] <wikibugs>	 (03CR) 10Hashar: "releases2003 (Bookworm) now has `thirdparty/jenkins` in `/etc/apt/sources.list.d/thirdparty-jenkins.sources`." [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar)
[13:10:39] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3533 MB (3% inode=98%): /tmp 3533 MB (3% inode=98%): /var/tmp 3533 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[13:15:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78878 and previous config saved to /var/cache/conftool/dbconfig/20250710-131541-root.json
[13:17:29] <wikibugs>	 (03Abandoned) 10David Caro: toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 (owner: 10David Caro)
[13:18:22] <moritzm>	 MichaelG_WMF: no, these are unrelated to the current mediawiki deployments
[13:18:23] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: enable monitoring for other instances [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854)
[13:18:23] <wikibugs>	 (03CR) 10Arnaudb: "@jwodstrcil@wikimedia.org highlighted a missing scraping from our current config in https://phabricator.wikimedia.org/T398854#10991075 thi" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[13:18:29] <wikibugs>	 (03CR) 10Hashar: "I have updated the reprepro documentation at https://wikitech.wikimedia.org/wiki/Jenkins#Get_the_package :)" [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar)
[13:19:14] <MichaelG_WMF>	 @moritzm ack, thanks for confirming!
[13:20:38] <wikibugs>	 (03PS10) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003)
[13:21:26] <wikibugs>	 (03PS2) 10Ssingh: team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619)
[13:26:09] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Point purged@eqsin to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1167860
[13:26:54] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167860 (owner: 10Vgutierrez)
[13:27:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:27:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:29:03] <wikibugs>	 (03Merged) 10jenkins-bot: team-traffic: dnsbox: alert after rule is true for 1m [alerts] - 10https://gerrit.wikimedia.org/r/1167716 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[13:29:25] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: disable OCSP for GTS certs [puppet] - 10https://gerrit.wikimedia.org/r/1167687 (https://phabricator.wikimedia.org/T399079) (owner: 10Ssingh)
[13:29:42] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] hiera: Point purged@eqsin to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1167860 (owner: 10Vgutierrez)
[13:29:57] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Point purged@eqsin to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1167860 (owner: 10Vgutierrez)
[13:30:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10991910 (10OKryva-WMF) I am Tran's EM. Approve the request.
[13:30:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78879 and previous config saved to /var/cache/conftool/dbconfig/20250710-133047-root.json
[13:30:48] <wikibugs>	 (03PS1) 10David Caro: toolforge: rename the jobs-cli to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861
[13:33:04] <wikibugs>	 (03PS1) 10Marostegui: db2211: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167862 (https://phabricator.wikimedia.org/T398928)
[13:33:30] <wikibugs>	 (03PS2) 10David Caro: toolforge: rename the jobs-cli and misctools to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861
[13:33:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2211: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167862 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[13:33:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge: rename the jobs-cli and misctools to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861 (owner: 10David Caro)
[13:34:14] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2211.codfw.wmnet with reason: Maintenance
[13:34:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2211 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78880 and previous config saved to /var/cache/conftool/dbconfig/20250710-133418-marostegui.json
[13:34:57] <_joe_>	 jouncebot: now
[13:34:57] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300)
[13:35:35] <wikibugs>	 (03PS3) 10David Caro: toolforge: rename the jobs-cli and misctools to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861
[13:36:14] <hashar>	 _joe_: I will do MichaelG_WMF backport patch
[13:36:24] <MichaelG_WMF>	 hashar: thank you!
[13:36:31] <_joe_>	 hashar: <3
[13:36:47] <hashar>	 or did you have some urgent stuff to do on wikikube?
[13:37:46] <_joe_>	 hashar: no, I was looking at what was in the calendar :)
[13:37:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167856 (owner: 10Michael Große)
[13:38:09] <hashar>	 _joe_: great :)
[13:38:21] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge: rename the jobs-cli and misctools to the new name [puppet] - 10https://gerrit.wikimedia.org/r/1167861 (owner: 10David Caro)
[13:38:34] <wikibugs>	 (03CR) 10David Caro: [C:03+2] "This is reverting the latest deploys, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1167861 (owner: 10David Caro)
[13:39:50] <wikibugs>	 (03Merged) 10jenkins-bot: fix(StructuredTask): wrong order in resolving a deferred [extensions/GrowthExperiments] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167856 (owner: 10Michael Große)
[13:40:14] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1167856|fix(StructuredTask): wrong order in resolving a deferred]]
[13:40:35] <wikibugs>	 (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[13:41:25] * MichaelG_WMF is ready to test with the debug extension whenever
[13:41:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78881 and previous config saved to /var/cache/conftool/dbconfig/20250710-134150-root.json
[13:42:17] <logmsgbot>	 !log hashar@deploy1003 migr, hashar: Backport for [[gerrit:1167856|fix(StructuredTask): wrong order in resolving a deferred]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:42:25] * MichaelG_WMF looks
[13:42:46] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "hiera: Point purged@eqsin to main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1167864
[13:43:05] <hashar>	 :)
[13:45:03] <MichaelG_WMF>	 hashar: I can confirm that this fixes the regression. Good to roll forward from my side 👍
[13:45:11] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] Revert "hiera: Point purged@eqsin to main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1167864 (owner: 10Vgutierrez)
[13:45:57] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Point purged@eqsin to main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1167864 (owner: 10Vgutierrez)
[13:46:03] <logmsgbot>	 !log hashar@deploy1003 migr, hashar: Continuing with sync
[13:46:07] <hashar>	 MichaelG_WMF: congratulations!
[13:46:09] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS trixie
[13:46:29] <MichaelG_WMF>	 hashar: thank you for helping me out! 🙏
[13:46:47] <volans>	 !log upgrade spicerack on cumin2002 to 11.3.0
[13:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:29] <logmsgbot>	 !log klausman@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=inference,name=codfw
[13:48:56] <logmsgbot>	 !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet
[13:49:50] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1008.eqiad.wmnet']
[13:49:54] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1036.eqiad.wmnet with reason: Maintenance
[13:51:11] <logmsgbot>	 !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1002.eqiad.wmnet
[13:51:25] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167856|fix(StructuredTask): wrong order in resolving a deferred]] (duration: 11m 10s)
[13:52:36] <hashar>	 MichaelG_WMF: the deploy is fully complete :]
[13:52:38] <hashar>	 jouncebot: now
[13:52:38] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300)
[13:52:47] <hashar>	 !log UTC afternoon backport window completed
[13:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:49] <MichaelG_WMF>	 hashar: cool, thanks!
[13:53:01] <hashar>	 _joe_: the backport window has completed :)
[13:53:31] <wikibugs>	 (03PS1) 10Elukey: profile::docker::reporter: add DSE and AUX clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167868 (https://phabricator.wikimedia.org/T397696)
[13:53:41] <logmsgbot>	 andrew@cumin1003 upgrade-firmware (PID 1134872) is awaiting input
[13:54:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167868 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[13:55:47] <hnowlan>	 jouncebot: nowandnext
[13:55:47] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1300)
[13:55:47] <jouncebot>	 In 0 hour(s) and 34 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1430)
[13:56:20] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add DSE and AUX clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167868 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[13:56:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78882 and previous config saved to /var/cache/conftool/dbconfig/20250710-135656-root.json
[13:57:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10992160 (10Jclark-ctr) >>! In T394333#10990367, @elukey wrote: > @Jclark-ctr IIUC it was a temporary failure right?  yes that wa...
[13:58:01] <wikibugs>	 (03PS11) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003)
[13:58:14] <logmsgbot>	 !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet
[13:58:42] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10992166 (10STran)
[14:00:05] <logmsgbot>	 !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1002.eqiad.wmnet
[14:01:47] <logmsgbot>	 !log volans@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet
[14:02:58] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[14:03:00] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[14:03:16] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'.
[14:03:20] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'.
[14:04:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10992180 (10Jclark-ctr) ml-serve1015 is now racked into E 12  and added to netbox   @elukey Let me know when you’re finished with any testing you want to do. I’ll stay...
[14:04:57] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1008.eqiad.wmnet']
[14:08:50] <wikibugs>	 (03CR) 10JHathaway: "thanks volans, fixes applied" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[14:10:39] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3486 MB (3% inode=98%): /tmp 3486 MB (3% inode=98%): /var/tmp 3486 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[14:11:16] <wikibugs>	 (03PS3) 10Ssingh: P:cache::haproxy, C:haproxy, hiera: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114)
[14:11:27] <wikibugs>	 (03CR) 10Ssingh: "rebased, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh)
[14:12:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78883 and previous config saved to /var/cache/conftool/dbconfig/20250710-141202-root.json
[14:12:36] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[14:15:21] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[14:15:42] <wikibugs>	 (03PS1) 10Jelto: aptrepo: add gitlab package for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1167871 (https://phabricator.wikimedia.org/T384595)
[14:16:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] analytics: deprioritize druid MapReduce jobs if needed [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo)
[14:16:26] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bookworm
[14:17:52] <wikibugs>	 (03PS2) 10Btullis: analytics: Absent rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1167224 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo)
[14:17:55] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167224 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo)
[14:18:03] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6235/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh)
[14:20:04] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[14:21:10] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1167838 (https://phabricator.wikimedia.org/T372804) (owner: 10Arnaudb)
[14:22:09] <wikibugs>	 (03PS10) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317
[14:23:00] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: enable gerrit.service and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167838 (https://phabricator.wikimedia.org/T372804) (owner: 10Arnaudb)
[14:24:24] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[14:26:40] <wikibugs>	 (03PS1) 10Elukey: preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044)
[14:27:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78884 and previous config saved to /var/cache/conftool/dbconfig/20250710-142707-root.json
[14:29:20] <wikibugs>	 (03PS11) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317
[14:29:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey)
[14:30:04] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1430)
[14:30:15] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[14:30:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[14:30:46] <wikibugs>	 (03CR) 10Jelto: "one question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[14:31:00] <wikibugs>	 (03PS2) 10Elukey: preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044)
[14:31:01] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[14:31:46] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey)
[14:32:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] analytics: Absent rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1167224 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo)
[14:33:10] <wikibugs>	 (03PS3) 10Elukey: preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044)
[14:33:18] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:33:35] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[14:33:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey)
[14:34:08] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:34:37] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage
[14:34:57] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM (requires spicerack to be released to all host before merging it)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[14:35:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1167871 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto)
[14:37:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] preseed: update sretest2006's config [puppet] - 10https://gerrit.wikimedia.org/r/1167873 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey)
[14:38:23] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage
[14:40:33] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková)
[14:41:44] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: no reason specified, no task ID specified]
[14:41:47] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: no reason specified, no task ID specified]
[14:45:20] <wikibugs>	 (03PS1) 10Tiziano Fogli: nrpe::mon_srv: propagate NRPE migration_task to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/1167876 (https://phabricator.wikimedia.org/T359443)
[14:54:13] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[14:54:33] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm
[14:56:16] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1008.eqiad.wmnet with OS bookworm
[14:56:33] <wikibugs>	 (03PS1) 10Btullis: Set 52 Hadoop nodes into decommissioning state [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160)
[14:59:32] <wikibugs>	 (03PS2) 10Btullis: Set 52 Hadoop nodes into decommissioning state [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160)
[15:00:05] <jouncebot>	 andre and jnuche: Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1500). Please do the needful.
[15:00:20] <andre>	 jouncebot: I don't think there's much to triage
[15:00:40] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6236/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis)
[15:00:45] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[15:01:56] <wikibugs>	 (03PS1) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T397926)
[15:02:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy)
[15:03:28] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[15:05:02] <wikibugs>	 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10992532 (10KartikMistry) Update: We've now staging server running using S3 model storage and observing logs...
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] nrpe::mon_srv: propagate NRPE migration_task to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/1167876 (https://phabricator.wikimedia.org/T359443) (owner: 10Tiziano Fogli)
[15:09:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "+1 to what Jelto said" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[15:11:06] <logmsgbot>	 !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on gerrit2003.wikimedia.org with reason: maintenance
[15:11:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10992557 (10elukey) relevant conversation from IRC:  ` <moritzm> I think the root problem is that on sretest2006 /var/lib/partman/devices is empty, it's the file wh...
[15:13:35] <wikibugs>	 (03PS2) 10Hnowlan: wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841)
[15:13:50] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] nrpe::mon_srv: propagate NRPE migration_task to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/1167876 (https://phabricator.wikimedia.org/T359443) (owner: 10Tiziano Fogli)
[15:14:38] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[15:15:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:15:08] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "It hurts a bit to see 52 of my friends leave, but LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis)
[15:15:12] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:15:46] <wikibugs>	 (03PS1) 10JHathaway: reimage: use ipxe DHCP info, skip d-i DHCP [puppet] - 10https://gerrit.wikimedia.org/r/1167883
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:16:58] <wikibugs>	 (03PS2) 10JHathaway: reimage: use ipxe DHCP info, skip d-i DHCP [puppet] - 10https://gerrit.wikimedia.org/r/1167883
[15:17:56] <wikibugs>	 (03PS1) 10Elukey: profile::docker::reporter: add Wikikube and ML serve prod clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696)
[15:18:10] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[15:18:44] <logmsgbot>	 elukey@cumin2002 reimage (PID 296806) is awaiting input
[15:20:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10992587 (10Dzahn) Gotcha, Tobi. Yea, seems no problem to do both in this ticket.
[15:20:28] <wikibugs>	 (03PS2) 10Elukey: profile::docker::reporter: add Wikikube and ML serve prod clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696)
[15:20:57] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[15:21:06] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm
[15:21:55] <wikibugs>	 (03PS3) 10Hnowlan: wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841)
[15:22:04] <wikibugs>	 (03PS3) 10Cwhite: logstash: move grafana status_code field to the right place [puppet] - 10https://gerrit.wikimedia.org/r/1164524 (https://phabricator.wikimedia.org/T234565)
[15:22:30] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[15:23:44] <wikibugs>	 (03PS3) 10JHathaway: reimage: use ipxe DHCP info, skip d-i DHCP [puppet] - 10https://gerrit.wikimedia.org/r/1167883
[15:23:53] <wikibugs>	 (03CR) 10Dzahn: "Will https monitoring actually work given that we currently get the "Forbidden" on https://gerrit-replica.wikimedia.org/r/monitoring  ?" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[15:24:00] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[15:25:28] <volans>	 !log upgrade spicerack to 11.3.0 on cumin100[2-3]
[15:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:53] <wikibugs>	 (03PS1) 10Volans: I/F: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886
[15:25:53] <wikibugs>	 (03PS1) 10Volans: o11y: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887
[15:25:53] <wikibugs>	 (03PS1) 10Volans: ServiceOps: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888
[15:25:54] <wikibugs>	 (03PS1) 10Volans: Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889
[15:28:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[15:29:40] <logmsgbot>	 !log hnowlan@dns1004 START - running authdns-update
[15:30:39] <logmsgbot>	 !log hnowlan@dns1004 END - running authdns-update
[15:31:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: move grafana status_code field to the right place [puppet] - 10https://gerrit.wikimedia.org/r/1164524 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:32:55] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm
[15:33:58] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver, cache: add config for edge routing of hcaptcha [puppet] - 10https://gerrit.wikimedia.org/r/1167670 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[15:34:27] <wikibugs>	 (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1164524 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:34:57] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T397926)
[15:35:21] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis mediawikiwiki, testwiki in section s3
[15:35:28] <wikibugs>	 (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 (owner: 10Volans)
[15:36:10] <wikibugs>	 (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans)
[15:36:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy)
[15:37:22] <wikibugs>	 (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 (owner: 10Volans)
[15:37:48] <wikibugs>	 (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans)
[15:38:51] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: move grafana status_code field to the right place [puppet] - 10https://gerrit.wikimedia.org/r/1164524 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:40:08] <cwhite>	 hnowlan: is the hcaptcha change ready for deploy?
[15:40:49] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[15:40:50] <hnowlan>	 cwhite: yes, please! 
[15:40:59] <hnowlan>	 if you're already merging
[15:41:05] <cwhite>	 will do :)
[15:44:09] <cwhite>	 {done}
[15:44:34] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167883 (owner: 10JHathaway)
[15:44:34] <wikibugs>	 (03PS3) 10Cwhite: logstash: flatten array of objects in stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/1164525 (https://phabricator.wikimedia.org/T234565)
[15:49:25] <logmsgbot>	 fceratto@cumin1002 sanitize-wiki (PID 1333657) is awaiting input
[15:49:41] <wikibugs>	 (03PS1) 10Hnowlan: wikimedia: simplify hcaptcha subsubdomains [dns] - 10https://gerrit.wikimedia.org/r/1167891 (https://phabricator.wikimedia.org/T397841)
[15:49:55] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10992657 (10jhathaway) would it be possible to setup a separate vrts server, that is configured with postfix, rather than replacing exim...
[15:50:15] <wikibugs>	 (03PS7) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910)
[15:51:00] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wikimedia: simplify hcaptcha subsubdomains [dns] - 10https://gerrit.wikimedia.org/r/1167891 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[15:52:13] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm
[15:52:53] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM, one question hits me but I think the logic works.  Also if we do the conditional the way dcaro suggests is fine no preference on my " [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott)
[15:54:25] <xcollazo>	 !log refreshed YARN queues definition in production via https://phabricator.wikimedia.org/T399013#10992686
[15:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:12] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver, profile::hcaptcha: simplify subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1167893 (https://phabricator.wikimedia.org/T397841)
[15:56:26] <wikibugs>	 (03PS8) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910)
[15:56:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] wikimedia: simplify hcaptcha subsubdomains [dns] - 10https://gerrit.wikimedia.org/r/1167891 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[15:57:10] <logmsgbot>	 !log hnowlan@dns1004 START - running authdns-update
[15:58:01] <logmsgbot>	 !log hnowlan@dns1004 END - running authdns-update
[15:58:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver, profile::hcaptcha: simplify subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1167893 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[15:58:58] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] trafficserver, profile::hcaptcha: simplify subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1167893 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[15:59:46] <wikibugs>	 (03PS1) 10Máté Szabó: Configure Special:CreateAccount instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167896 (https://phabricator.wikimedia.org/T394744)
[16:00:05] <jouncebot>	 jhathaway and moritzm: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1600)
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:31] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott)
[16:00:53] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Set 52 Hadoop nodes into decommissioning state [puppet] - 10https://gerrit.wikimedia.org/r/1167878 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis)
[16:02:38] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver, profile::hcaptcha: simplify subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1167893 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[16:04:13] <wikibugs>	 (03PS1) 10Volans: Data Persistence: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898
[16:05:04] <wikibugs>	 (03CR) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott)
[16:05:38] <wikibugs>	 (03CR) 10JHathaway: "tested a UUID reimage successfully for sretest2001, in combination with 1167883" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway)
[16:05:56] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10992727 (10Jhancock.wm) @elukey i got 2044 pingable. i set a few things on this one, including the password, in the idrac. i also got 2045 pingable. on this one i only...
[16:07:15] <wikibugs>	 (03CR) 10Volans: "More behavior options available in the commit message." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans)
[16:09:08] <wikibugs>	 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10992745 (10SKivlehan-WMF) 05In progress→03Resolved I'm in! Marking as Resolved, thank you all for the assistance here.
[16:10:52] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Good plan." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy)
[16:11:18] <wikibugs>	 (03CR) 10Jforrester: "00:00:19.370 1) InitialiseSettingsTest::testMustHaveConfigs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy)
[16:11:50] <wikibugs>	 (03PS1) 10FNegri: openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212)
[16:12:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri)
[16:13:20] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott)
[16:14:50] <wikibugs>	 (03PS1) 10Federico Ceratto: sanitize-wiki: Support sections other than s5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1167895 (https://phabricator.wikimedia.org/T399178)
[16:14:50] <wikibugs>	 (03CR) 10Federico Ceratto: "Allows setting sections other than s5" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167895 (https://phabricator.wikimedia.org/T399178) (owner: 10Federico Ceratto)
[16:15:18] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott)
[16:15:24] <wikibugs>	 (03PS2) 10FNegri: openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212)
[16:18:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri)
[16:20:43] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:21:51] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:22:07] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:22:43] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:23:34] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:24:40] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:26:14] <wikibugs>	 (03PS2) 10Jforrester: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T397926) (owner: 10Daimona Eaytoy)
[16:26:14] <wikibugs>	 (03PS1) 10Jforrester: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T397926)
[16:39:55] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: flatten array of objects in stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/1164525 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[16:42:49] <wikibugs>	 (03PS9) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910)
[16:42:49] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudceph osd.yaml: update some nic names for Bookworm reimages [puppet] - 10https://gerrit.wikimedia.org/r/1167905
[16:46:16] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167905 (owner: 10Andrew Bogott)
[16:46:32] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM.  I confirmed they are the names the system is now using." [puppet] - 10https://gerrit.wikimedia.org/r/1167905 (owner: 10Andrew Bogott)
[16:50:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudceph osd.yaml: update some nic names for Bookworm reimages [puppet] - 10https://gerrit.wikimedia.org/r/1167905 (owner: 10Andrew Bogott)
[16:50:43] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis mediawikiwiki, testwiki in section s3
[16:54:26] <wikibugs>	 (03PS2) 10BryanDavis: puppetserver: check for rebase in puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877)
[16:55:47] <wikibugs>	 (03CR) 10BryanDavis: puppetserver: check for rebase in puppetserver-deploy-code (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) (owner: 10BryanDavis)
[17:00:05] <jouncebot>	 bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T1700)
[17:01:00] <bd808>	 Nothing to push out in my window this week
[17:04:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] sanitize-wiki: Support sections other than s5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1167895 (https://phabricator.wikimedia.org/T399178) (owner: 10Federico Ceratto)
[17:05:20] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[17:07:10] <wikibugs>	 (03PS5) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936)
[17:09:58] <wikibugs>	 (03PS6) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936)
[17:10:32] <wikibugs>	 (03CR) 10Herron: [C:03+1] "Note: We will need to manually clean the old pyrra configs that will be orphaned by this change" [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey)
[17:12:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78886 and previous config saved to /var/cache/conftool/dbconfig/20250710-171214-root.json
[17:14:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:19:04] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppetserver: check for rebase in puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) (owner: 10BryanDavis)
[17:19:19] <wikibugs>	 (03CR) 10Federico Ceratto: "LGTM, the small change in `--task` vs `--task-id` should not be an issue, also afaik Manuel tends to use `-t` anyways." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans)
[17:19:47] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] Data Persistence: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans)
[17:21:43] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T397926) (owner: 10Jforrester)
[17:21:59] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1049.eqiad.wmnet']
[17:22:17] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1049.eqiad.wmnet']
[17:25:23] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1049.eqiad.wmnet
[17:25:34] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1049.eqiad.wmnet
[17:27:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78887 and previous config saved to /var/cache/conftool/dbconfig/20250710-172719-root.json
[17:28:25] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester)
[17:28:40] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[17:28:42] <wikibugs>	 (03PS3) 10Daimona Eaytoy: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester)
[17:29:03] <wikibugs>	 (03PS4) 10Daimona Eaytoy: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester)
[17:29:30] <wikibugs>	 (03PS3) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549)
[17:29:43] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549)
[17:30:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy)
[17:33:03] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[17:33:29] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910
[17:35:00] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Given the incident today and in general, I will just merge this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh)
[17:39:40] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910
[17:39:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy)
[17:42:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78889 and previous config saved to /var/cache/conftool/dbconfig/20250710-174225-root.json
[17:42:32] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[17:47:56] <wikibugs>	 (03CR) 10BryanDavis: [V:03+1] "Seems to be working as hoped to power k8s-api.svc.zuul.eqiad1.wikimedia.cloud:" [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis)
[17:54:16] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#10993075 (10cmooney) We may need to hold off on this for now.  The requirement for jumbo frames poses a difficulty for the plan as the parent i...
[17:54:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:55:28] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[17:55:43] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[17:56:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10993082 (10cmooney)
[17:56:22] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[17:57:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78890 and previous config saved to /var/cache/conftool/dbconfig/20250710-175730-root.json
[17:59:15] <wikibugs>	 (03PS4) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549)
[18:00:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] aptrepo: add gitlab package for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1167871 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto)
[18:00:46] <wikibugs>	 (03PS3) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists do not contain private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549)
[18:03:31] <wikibugs>	 (03PS4) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists contain only SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549)
[18:05:51] <wikibugs>	 (03CR) 10Daimona Eaytoy: "I made Iab79188f72664247d for another setting that can be migrated. I missed fishbowl wikis when originally writing that, which is why I g" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[18:15:52] <wikibugs>	 (03PS3) 10Cwhite: logstash: use filter_on_templates_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1164526 (https://phabricator.wikimedia.org/T234565)
[18:18:54] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[18:19:28] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[18:26:15] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10993155 (10Jhancock.wm)
[18:28:07] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[18:28:23] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[18:36:32] <wikibugs>	 (03PS10) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[18:36:32] <wikibugs>	 (03PS3) 10Jforrester: Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy)
[18:37:02] <wikibugs>	 (03CR) 10Jforrester: "PS10: Manual rebase. Let's land this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[18:38:14] <wikibugs>	 (03CR) 10Daimona Eaytoy: "Thank you. Config diff LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy)
[18:38:31] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd1035: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167914 (https://phabricator.wikimedia.org/T396651)
[18:38:33] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd1036: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167915 (https://phabricator.wikimedia.org/T396651)
[18:38:38] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd1037: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167916 (https://phabricator.wikimedia.org/T396651)
[18:38:41] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd1038: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167917 (https://phabricator.wikimedia.org/T396651)
[18:38:43] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd1039: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167918 (https://phabricator.wikimedia.org/T396651)
[18:38:44] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd1040: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167919 (https://phabricator.wikimedia.org/T396651)
[18:38:46] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd1041: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167920 (https://phabricator.wikimedia.org/T396651)
[18:39:26] <sukhe>	 !log sukhe@cp5017:~$ sudo systemctl stop trafficserver.service && sudo traffic_server -C clear_cache && sudo systemctl start trafficserver.service
[18:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:30] <sukhe>	 !log sukhe@cp5017:~$ sudo systemctl stop trafficserver.service && sudo traffic_server -C clear_cache && sudo systemctl start trafficserver.service: T399221
[18:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:34] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[18:39:47] <sukhe>	 !log clearing varnish and ATS cache on cp5017 before repooling eqsin: T399221
[18:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1035: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167914 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott)
[18:42:31] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:42:45] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:43:06] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:43:18] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:43:46] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:43:58] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:44:18] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:44:29] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:44:45] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:44:53] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:47:09] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: arelion drained; traffic is going through ulsfo to codfw, T399221]
[18:47:13] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[18:47:20] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: arelion drained; traffic is going through ulsfo to codfw, T399221]
[18:48:58] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:49:08] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:49:58] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:50:08] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:50:17] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:50:45] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:50:54] <logmsgbot>	 !log andrew@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:55:21] <wikibugs>	 (03PS6) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833)
[18:55:52] <wikibugs>	 (03CR) 10Dzahn: "amended to change "passive host" to "replica host"" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[18:58:35] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:58:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[18:58:46] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[18:59:31] <logmsgbot>	 !log andrew@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[19:00:02] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1035.eqiad.wmnet']
[19:00:12] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1035.eqiad.wmnet']
[19:00:18] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[19:00:28] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1006.eqiad.wmnet']
[19:01:19] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[19:01:28] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[19:01:30] <wikibugs>	 (03CR) 10Jforrester: "OK, final(?) review:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[19:02:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[19:02:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10993245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eq...
[19:04:31] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10993263 (10Dzahn) Hosts are not virtual, they are physical machines. So the biggest issue with that would be where to get hardware from...
[19:05:10] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[19:05:21] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[19:07:39] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[19:07:47] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[19:10:29] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: "SSD firmware fetch from DELL website not yet implemented" - https://phabricator.wikimedia.org/T399234 (10Andrew) 03NEW
[19:15:12] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:19:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[19:22:09] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage
[19:24:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[19:28:23] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage
[19:46:01] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bookworm
[19:46:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10993386 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad....
[19:50:57] <wikibugs>	 (03CR) 10BryanDavis: [V:03+1] "LGTM. I'm still not quite sure I understand why the flags are changing for Beta's votewiki, but we can chase that more if anyone ever find" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[19:53:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[19:55:04] <wikibugs>	 (03PS1) 10Cwhite: logstash: fix gitlab event field type conflict [puppet] - 10https://gerrit.wikimedia.org/r/1167926 (https://phabricator.wikimedia.org/T234565)
[19:57:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: fix gitlab event field type conflict [puppet] - 10https://gerrit.wikimedia.org/r/1167926 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[19:59:05] <wikibugs>	 (03PS2) 10Cwhite: logstash: fix gitlab event field type conflict [puppet] - 10https://gerrit.wikimedia.org/r/1167926 (https://phabricator.wikimedia.org/T234565)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T2000).
[20:00:04] <jouncebot>	 James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:08:50] <wikibugs>	 (03PS3) 10LD: wmf-config/core-Permissions.php: sort keys alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167927
[20:09:51] <wikibugs>	 (03CR) 10LD: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167927 (owner: 10LD)
[20:12:00] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: fix gitlab event field type conflict [puppet] - 10https://gerrit.wikimedia.org/r/1167926 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[20:12:44] <logmsgbot>	 !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@c558ea4]: Artifactct analytics-test
[20:12:57] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@c558ea4]: Artifactct analytics-test (duration: 00m 13s)
[20:13:37] <logmsgbot>	 !log aqu@deploy1003 Started deploy [airflow-dags/analytics@c558ea4]: Artifactct analytics / main
[20:14:20] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [airflow-dags/analytics@c558ea4]: Artifactct analytics / main (duration: 00m 43s)
[20:17:24] <wikibugs>	 (03CR) 10LD: "JSON key order doesn't affect behavior, so Jenkins may not detect this as a meaningful change, but the keys were reordered alphabetically " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167927 (owner: 10LD)
[20:21:54] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[20:24:43] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[20:24:54] <logmsgbot>	 !log root@cumin1003 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[20:25:15] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1035.eqiad.wmnet
[20:25:16] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[20:25:19] <logmsgbot>	 !log root@cumin1003 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[20:30:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[20:30:26] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[20:31:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[20:35:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[20:35:26] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[20:36:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[20:37:34] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10993532 (10Arnoldokoth) @Dzahn Or we could repurpose a spare server (if available)? `miscweb` comes to mind... Or were those VMs?
[20:39:26] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1035.eqiad.wmnet
[20:39:29] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[20:40:13] <wikibugs>	 (03PS1) 10Ahmon Dancy: logspam.pl: Avoid consolidation of wrapped error message [puppet] - 10https://gerrit.wikimedia.org/r/1167932 (https://phabricator.wikimedia.org/T399239)
[20:40:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993548 (10RobH)
[20:40:55] <James_F>	 Argh, finally back online.
[20:41:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[20:41:57] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[20:42:10] <wikibugs>	 (03Merged) 10jenkins-bot: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis)
[20:42:17] <bd808>	 thanks for keeping that patch alive James_F :)
[20:42:21] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1137480|Use `sul` dblist in InitialiseSettings]]
[20:42:25] <James_F>	 bd808: Thank you for working on it!
[20:42:27] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cloudcephosd1035.eqiad.wmnet
[20:42:46] <James_F>	 Testing it on debug will be fun. Which of ~2000 settings on ~1000 wikis still work?
[20:43:27] <bd808>	 #someday we will have "user journey tests". #someday
[20:43:51] * James_F has a bridge to sell you.
[20:44:11] <James_F>	 TBF, for Wikifunctions we do indeed have our Critical User Journeys with matching browser tests for each.
[20:44:17] <James_F>	 So it is possible. :-)
[20:44:22] <logmsgbot>	 !log jforrester@deploy1003 jforrester, bd808: Backport for [[gerrit:1137480|Use `sul` dblist in InitialiseSettings]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:44:26] <wikibugs>	 (03Abandoned) 10Ahmon Dancy: logspam: Consolidate several more persistent log messages [puppet] - 10https://gerrit.wikimedia.org/r/1056232 (owner: 10Ahmon Dancy)
[20:44:33] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bookworm
[20:44:39] <bd808>	 step 1: get in in the APP. step 2: ???. step 3: PROFIT!
[20:44:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1035.eqia...
[20:48:37] <James_F>	 OK, let's do it.
[20:48:39] <logmsgbot>	 !log jforrester@deploy1003 jforrester, bd808: Continuing with sync
[20:54:04] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137480|Use `sul` dblist in InitialiseSettings]] (duration: 11m 43s)
[20:55:00] <bd808>	 enwiki still shows the main_page. things must be fine! :)
[20:55:23] <James_F>	 WCPGW?!
[20:55:25] <James_F>	 Yeah.
[20:56:08] <wikibugs>	 (03CR) 10Jforrester: "Of course! I didn't want to deploy this alongside the parent, but I think this is now good to land." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy)
[20:57:12] <wikibugs>	 (03CR) 10Brennen Bearnes: [C:03+1] "Tested on mwlog1002; LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1167932 (https://phabricator.wikimedia.org/T399239) (owner: 10Ahmon Dancy)
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T2100)
[21:06:58] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage
[21:10:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:12:51] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage
[21:22:15] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10993697 (10Dzahn) @Arnoldokoth If there is a spare server, sure, but I am not sure there is one. Back in the days dcops had a spare poo...
[21:26:44] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10993701 (10Dzahn) Well... or we could create a VM and try to install VRTS with postfix on that. If that works (where I'm not sure how t...
[21:30:39] <wikibugs>	 (03PS5) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549)
[21:30:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy)
[21:31:11] <wikibugs>	 (03PS5) 10Daimona Eaytoy: Explicitly set wgServer etc. for private wikis under the 'wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester)
[21:31:48] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: "SSD firmware fetch from DELL website not yet implemented" - https://phabricator.wikimedia.org/T399234#10993706 (10RobH) 05Open→03Resolved a:03RobH IRC Update:  The file it was looking for didn't exist on the cumin1003 host, but does on cumin20...
[21:32:28] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bookworm
[21:32:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1035.eqiad.wm...
[21:36:47] <wikibugs>	 (03PS6) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549)
[21:47:09] <wikibugs>	 (03PS7) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549)
[21:47:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "nice:)" [puppet] - 10https://gerrit.wikimedia.org/r/1167823 (https://phabricator.wikimedia.org/T392127) (owner: 10Hashar)
[21:47:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy)
[21:49:01] <wikibugs>	 (03Abandoned) 10Andrew Bogott: cloudcephosd1035: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167914 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott)
[21:49:30] <wikibugs>	 (03PS4) 10Cwhite: logstash: use filter_on_templates_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1164526 (https://phabricator.wikimedia.org/T234565)
[21:55:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:55:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:55:49] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1036.eqiad.wmnet
[21:55:49] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: use filter_on_templates_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1164526 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[21:57:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:59:31] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941
[22:00:02] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1036.eqiad.wmnet
[22:00:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy)
[22:05:39] <wikibugs>	 (03PS8) 10Daimona Eaytoy: [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549)
[22:06:30] <wikibugs>	 (03CR) 10Daimona Eaytoy: "Config diff review:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy)
[22:06:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy)
[22:10:42] <wikibugs>	 (03PS1) 10Cwhite: logstash: remove filter_on_templates v1 [puppet] - 10https://gerrit.wikimedia.org/r/1167942 (https://phabricator.wikimedia.org/T234565)
[22:10:44] <wikibugs>	 (03PS1) 10Cwhite: logstash: rename filter-on-templates.rb [puppet] - 10https://gerrit.wikimedia.org/r/1167943 (https://phabricator.wikimedia.org/T234565)
[22:13:06] <wikibugs>	 (03PS1) 10Zabe: Fix categorylinks read new query for excluded categories [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167944 (https://phabricator.wikimedia.org/T385890)
[22:13:35] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1036.eqiad.wmnet
[22:13:39] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1036.eqiad.wmnet
[22:16:39] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bookworm
[22:16:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1036.eqia...
[22:20:57] <zabe>	 jouncebot: nowandnext
[22:20:57] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 39 minute(s)
[22:20:57] <jouncebot>	 In 7 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T0600)
[22:21:01] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Fix categorylinks read new query for excluded categories [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167944 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe)
[22:21:55] <wikibugs>	 (03Merged) 10jenkins-bot: Fix categorylinks read new query for excluded categories [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167944 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe)
[22:22:57] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167944|Fix categorylinks read new query for excluded categories (T385890)]]
[22:23:02] <stashbot>	 T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890
[22:24:56] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1167944|Fix categorylinks read new query for excluded categories (T385890)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:25:39] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[22:30:56] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167944|Fix categorylinks read new query for excluded categories (T385890)]] (duration: 07m 59s)
[22:31:00] <stashbot>	 T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890
[22:39:27] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage
[22:42:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[22:43:26] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage
[22:52:37] <wikibugs>	 (03Abandoned) 10Andrew Bogott: cloudcephosd1036: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167915 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott)
[23:02:59] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bookworm
[23:03:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1036.eqiad.wm...
[23:03:17] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1037.eqiad.wmnet
[23:08:59] <logmsgbot>	 andrew@cumin2002 upgrade-firmware (PID 535018) is awaiting input
[23:09:06] <wikibugs>	 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#10993801 (10Dzahn) To my surprise it seems like profile::httpd is only included in role::config_master anymore but that's it.
[23:13:53] <wikibugs>	 (03PS1) 10Dzahn: profile::httpd: include prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434)
[23:15:12] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:21:50] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "it is used on far fewer roles anymore than it used be. seems like in prod it's just puppetserver and config-master, where "just" is relati" [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434) (owner: 10Dzahn)
[23:38:13] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167966
[23:38:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167966 (owner: 10TrainBranchBot)
[23:40:35] <wikibugs>	 06SRE: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10993827 (10Dzahn) Should we just talk to Marko directly and ask if he uses this?   Then it becomes clear if a new NDA should be created or just access removed.  https://www.linkedin.com/in/doorman
[23:44:01] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1037.eqiad.wmnet
[23:49:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[23:50:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167966 (owner: 10TrainBranchBot)
[23:57:53] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1037.eqiad.wmnet
[23:57:56] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1037.eqiad.wmnet
[23:59:04] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bookworm
[23:59:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037.eqia...