[00:06:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965826 (10Jclark-ctr) @VRiley-WMF   Was a Dell ticket opened for this server? We have two other servers experiencing the same issue, and it has now reoccurred. T383051 T397851  T397829  @Eevans
[00:08:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#10965830 (10Jclark-ctr) @Clement_Goubert  it has cleared for the time i am still working with dell since this seems to be reoccurring  issues   i...
[00:08:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165630
[00:08:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165630 (owner: 10TrainBranchBot)
[00:09:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10965831 (10Jclark-ctr) @Clement_Goubert   it has cleared for the time i am still working with dell since this seems to be reoccurring  issues   if...
[00:31:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165630 (owner: 10TrainBranchBot)
[00:32:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[00:32:45] <jinxer-wm>	 FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[00:37:45] <jinxer-wm>	 FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[00:46:28] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[00:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[00:52:45] <jinxer-wm>	 FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[00:53:44] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm
[00:57:45] <jinxer-wm>	 RESOLVED: [2x] Traffic bill over quota: Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[02:05:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965935 (10Jclark-ctr) @wiki_willy tagging you also for visibility.  @Jhancock.wm  @VRiley-WMF   we should be opening tickets for this error with dell  for these errors here is a quick list of servers...
[02:07:36] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965949 (10Jclark-ctr) @MatthewVernon  these are failing puppet do you need to set site.pp for insetup?
[02:07:38] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965950 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1092.eqiad.wmnet with OS bullseye executed...
[02:07:41] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1093.eqiad.wmnet with OS bullseye executed...
[02:12:01] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye
[02:23:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:28:32] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[02:32:44] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[02:50:02] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye
[03:11:22] <wikibugs>	 (03PS1) 10EggRoll97: Allow abusefilter block action on plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137)
[04:31:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[04:41:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[04:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:11:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:56:24] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Lock dulwich dependency at 0.22.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede)
[05:57:23] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300
[05:57:26] <stashbot>	 T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300
[05:58:34] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300
[06:00:04] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0600)
[06:02:57] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300
[06:03:00] <stashbot>	 T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300
[06:04:19] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300
[06:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:09:44] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833)
[06:12:16] <wikibugs>	 (03CR) 10Arnaudb: "followed wikitech instructions to prep plugin installation via CI, let me know if anything else is required. I needed to edit the repo's ." [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[06:15:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool pc4 T378715', diff saved to https://phabricator.wikimedia.org/P78735 and previous config saved to /var/cache/conftool/dbconfig/20250702-061517-ladsgroup.json
[06:15:20] <stashbot>	 T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715
[06:23:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:28:22] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Switch to 10G (T378715)
[06:28:25] <stashbot>	 T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715
[06:29:30] <Amir1>	 !log dropping l10n_cache table everywhere (T397367)
[06:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:32] <stashbot>	 T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367
[06:31:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:32:39] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:33:29] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:35:33] <wikibugs>	 (03PS1) 10Ayounsi: eqiad/codfw: remove more former TE [homer/public] - 10https://gerrit.wikimedia.org/r/1165732 (https://phabricator.wikimedia.org/T377844)
[06:42:35] <icinga-wm>	 PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[06:42:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:43:25] <icinga-wm>	 RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.008 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[06:48:03] <icinga-wm>	 PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor
[06:49:53] <icinga-wm>	 RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 679 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[06:51:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for vpoundstone [puppet] - 10https://gerrit.wikimedia.org/r/1165767
[06:52:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove access for vpoundstone [puppet] - 10https://gerrit.wikimedia.org/r/1165767 (owner: 10Muehlenhoff)
[06:52:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:53:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for vpoundstone [puppet] - 10https://gerrit.wikimedia.org/r/1165767
[06:54:39] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:58:31] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29772 bytes in 0.686 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:00:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0700).
[07:00:05] <jouncebot>	 EggRoll97: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:13] <EggRoll97>	 o/
[07:11:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for vpoundstone [puppet] - 10https://gerrit.wikimedia.org/r/1165767 (owner: 10Muehlenhoff)
[07:12:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[07:18:22] <EggRoll97>	 Anyone available for deployment? (I'm not sure if I need to ask, but I haven't seen anyone yet)
[07:20:44] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "None of our other plugins (ex: go-import, lfs, zuul) are mentioned in `.gitignore`. If I try it I get:" [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[07:27:03] <Amir1>	 I'm in a conference, can't deploy stuff today :(
[07:27:33] <EggRoll97>	 Darn, thanks for letting me know though. urbanecm: are you available?
[07:29:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:37:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetserver1003/puppetserver2004 for maintenance [dns] - 10https://gerrit.wikimedia.org/r/1165815
[07:38:20] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti5007.eqsin.wmnet with reason: reimage
[07:40:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5007.eqsin.wmnet with OS bookworm
[07:40:40] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10966248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5007.eqsin.wmnet with OS bookworm
[07:41:00] <phuedx>	 jouncebot now
[07:41:00] <jouncebot>	 For the next 12 hour(s) and 48 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430)
[07:41:00] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0700)
[07:41:05] <phuedx>	 jouncebot refresh
[07:41:06] <jouncebot>	 I refreshed my knowledge about deployments.
[07:41:11] <phuedx>	 jouncebot now
[07:41:11] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0700)
[07:41:19] <phuedx>	 Better :)
[07:48:32] <wikibugs>	 (03CR) 10MVernon: [C:03+1] bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[07:49:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver1003/puppetserver2004 for maintenance [dns] - 10https://gerrit.wikimedia.org/r/1165815 (owner: 10Muehlenhoff)
[07:49:39] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[07:50:41] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[07:52:56] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo)
[07:53:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] thanos: start sampled traces from query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1165493 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi)
[07:53:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi)
[07:54:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414)
[07:54:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi)
[07:54:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:55:14] <wikibugs>	 (03Abandoned) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[08:00:05] <jouncebot>	 jnuche and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0800).
[08:00:34] <wikibugs>	 (03Restored) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[08:01:25] <jnuche>	 morning, the train will roll out shortly
[08:02:56] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetserver1003.eqiad.wmnet
[08:03:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5007.eqsin.wmnet with reason: host reimage
[08:04:42] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833)
[08:06:52] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165819 (https://phabricator.wikimedia.org/T392178)
[08:06:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165819 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot)
[08:07:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5007.eqsin.wmnet with reason: host reimage
[08:07:46] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165819 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot)
[08:10:28] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1003.eqiad.wmnet
[08:10:31] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:13:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:16:14] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.8  refs T392178
[08:16:17] <stashbot>	 T392178: 1.45.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T392178
[08:18:52] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833)
[08:20:52] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetserver2004.codfw.wmnet
[08:24:17] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202842s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:27:08] <wikibugs>	 (03CR) 10Ayounsi: "overall lgtm, not easy to do a thorough review." [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[08:28:44] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2004.codfw.wmnet
[08:30:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Remove puppetserver1003/puppetserver2004 for maintenance" [dns] - 10https://gerrit.wikimedia.org/r/1165821
[08:32:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Maybe I'm missing something, though I'd expect cleanup to happen on service start so pyrra-filesystem starts with a blank slate." [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[08:32:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove puppetserver1003/puppetserver2004 for maintenance" [dns] - 10https://gerrit.wikimedia.org/r/1165821 (owner: 10Muehlenhoff)
[08:33:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5007.eqsin.wmnet with OS bookworm
[08:33:09] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[08:33:15] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10966375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5007.eqsin.wmnet with OS bookworm completed: - ganeti5007 (**PASS*...
[08:33:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: pyrra-filesystem: clear output file on service stop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[08:34:12] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[08:34:57] <wikibugs>	 (03CR) 10Cathal Mooney: "thanks for the review, few replies in line I will submit another patch later with those few updates." [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[08:43:11] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10966425 (10cmooney) >>! In T396396#10955048, @Andrew wrote: >>>! In T396396#10954940, @cmooney wrote: >> Folks you need to delete th...
[08:43:23] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Remove backup1001 old backup director host from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1165822 (https://phabricator.wikimedia.org/T387892)
[08:43:26] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Remove backup2001, old offsite backup host [puppet] - 10https://gerrit.wikimedia.org/r/1165823 (https://phabricator.wikimedia.org/T387892)
[08:43:38] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet
[08:43:38] <wikibugs>	 (03PS1) 10Jelto: miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165824 (https://phabricator.wikimedia.org/T398303)
[08:46:45] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[08:47:33] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[08:47:35] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[08:47:38] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[08:47:42] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v0.6.4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165006 (owner: 10Volans)
[08:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[08:53:19] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet
[08:53:21] <wikibugs>	 (03Merged) 10jenkins-bot: k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[08:54:39] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[08:55:19] <wikibugs>	 (03PS2) 10Volans: debmonitor: add link to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696)
[08:55:20] <wikibugs>	 (03PS3) 10Volans: debmonitor: use the new endpoint for the check [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696)
[08:55:20] <wikibugs>	 (03PS1) 10Volans: debmonitor: fix debmonitor_servers hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1165825 (https://phabricator.wikimedia.org/T397696)
[08:56:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5007.eqsin.wmnet to cluster eqsin and group 1
[08:57:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1165825 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:58:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5007.eqsin.wmnet to cluster eqsin and group 1
[08:59:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:59:45] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10966502 (10MoritzMuehlenhoff)
[08:59:53] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10966505 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done!
[09:00:05] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[09:00:06] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.6.4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165006 (owner: 10Volans)
[09:00:35] <icinga-wm>	 PROBLEM - puppetboard.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[09:01:01] <vgutierrez>	 ^^ expected?
[09:01:25] <icinga-wm>	 RECOVERY - puppetboard.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[09:01:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#10966513 (10MoritzMuehlenhoff)
[09:01:51] <moritzm>	 !log rebalance ganeti/eqsin following Bookworm reimages
[09:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3006.esams.wmnet to cluster esams02 and group BW27
[09:04:02] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3006.esams.wmnet to cluster esams02 and group BW27
[09:04:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412 (10cmooney) 03NEW p:05Triage→03Low
[09:04:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet
[09:05:14] <wikibugs>	 (03CR) 10Ayounsi: Switch BGP: Automate & unify IBGP configs on switches (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[09:05:24] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[09:05:25] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#10966569 (10ops-monitoring-bot) Draining ganeti3006.esams.wmnet of running VMs
[09:06:23] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[09:06:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet
[09:06:58] <volans>	 !log uploaded debmonitor-server,python3-debmonitor_0.6.4 to apt.wikimedia.org bookworm-wikimedia
[09:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:07] <jynus>	 there is some error on bacula config, I am debugging now
[09:07:27] <icinga-wm>	 PROBLEM - bacula director process on backup1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:07:36] <jynus>	 ^ this is the error
[09:08:21] <zabe>	 jouncebot: nowandnext
[09:08:21] <jouncebot>	 For the next 0 hour(s) and 51 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0800)
[09:08:21] <jouncebot>	 In 0 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1000)
[09:08:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:09:43] <wikibugs>	 (03CR) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[09:10:18] <zabe>	 jnuche: do you currently need the window for the train or may I do a backport?
[09:10:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:11:09] <jnuche>	 zabe: train is stable right now, please go ahead :)
[09:11:18] <zabe>	 thanks:)
[09:11:34] <wikibugs>	 (03PS1) 10Zabe: Fix categorylinks join order and use index on correct table [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165827 (https://phabricator.wikimedia.org/T398380)
[09:11:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:11:44] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Fix categorylinks join order and use index on correct table [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165827 (https://phabricator.wikimedia.org/T398380) (owner: 10Zabe)
[09:12:13] <jynus>	 found the issue with bacula director, a leftover from an old host
[09:12:17] <jynus>	 sending patch
[09:13:17] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10966613 (10MoritzMuehlenhoff)
[09:14:28] <wikibugs>	 (03Merged) 10jenkins-bot: Fix categorylinks join order and use index on correct table [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165827 (https://phabricator.wikimedia.org/T398380) (owner: 10Zabe)
[09:15:26] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Remove reference to old backup pool, now removed [puppet] - 10https://gerrit.wikimedia.org/r/1165828 (https://phabricator.wikimedia.org/T398188)
[09:15:27] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165827|Fix categorylinks join order and use index on correct table (T398380)]]
[09:15:31] <stashbot>	 T398380: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT  page_id,page_namespace,page_title,page_is_redirect,page_len, - https://phabricator.wikimedia.org/T398380
[09:15:38] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Remove reference to old backup pool, now removed [puppet] - 10https://gerrit.wikimedia.org/r/1165828 (https://phabricator.wikimedia.org/T398188)
[09:16:52] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Remove reference to old backup pool, now removed [puppet] - 10https://gerrit.wikimedia.org/r/1165828 (https://phabricator.wikimedia.org/T398188) (owner: 10Jcrespo)
[09:17:31] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch lvs4010 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1165830 (https://phabricator.wikimedia.org/T396561)
[09:17:39] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1165827|Fix categorylinks join order and use index on correct table (T398380)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:18:23] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[09:18:52] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165830 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[09:19:17] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165563 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[09:19:48] <wikibugs>	 (03PS1) 10Zabe: Reapply "categorylinks: Set group0 to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165831 (https://phabricator.wikimedia.org/T397912)
[09:20:02] <jynus>	 bacula should be healthy now
[09:20:15] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Consolidate katran config for magru [puppet] - 10https://gerrit.wikimedia.org/r/1165563 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[09:20:27] <icinga-wm>	 RECOVERY - bacula director process on backup1014 is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:23:53] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165827|Fix categorylinks join order and use index on correct table (T398380)]] (duration: 08m 26s)
[09:23:56] <stashbot>	 T398380: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT  page_id,page_namespace,page_title,page_is_redirect,page_len, - https://phabricator.wikimedia.org/T398380
[09:24:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:24:25] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Reapply "categorylinks: Set group0 to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165831 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[09:24:28] <wikibugs>	 (03PS1) 10Hashar: gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693)
[09:24:37] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[09:24:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[09:25:18] <wikibugs>	 (03Merged) 10jenkins-bot: Reapply "categorylinks: Set group0 to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165831 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[09:25:45] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165831|Reapply "categorylinks: Set group0 to read new" (T397912)]]
[09:25:48] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[09:26:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:27:53] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1165831|Reapply "categorylinks: Set group0 to read new" (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:28:38] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[09:29:18] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1093.eqiad.wmnet with OS bullseye
[09:29:26] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966662 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1093.eqiad.wmnet with OS bullseye
[09:30:03] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1092.eqiad.wmnet with OS bullseye
[09:30:11] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1092.eqiad.wmnet with OS bullseye
[09:31:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:35:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet
[09:36:01] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165831|Reapply "categorylinks: Set group0 to read new" (T397912)]] (duration: 10m 15s)
[09:36:03] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[09:36:14] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10966680 (10ops-monitoring-bot) Draining ganeti6004.drmrs.wmnet of running VMs
[09:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:36:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet
[09:37:44] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts backup2001.codfw.wmnet
[09:38:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10966696 (10ayounsi)
[09:39:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10966702 (10ayounsi) option 2 lgtm!
[09:39:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to plain
[09:40:28] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Remove backup2001, old offsite backup host [puppet] - 10https://gerrit.wikimedia.org/r/1165823 (https://phabricator.wikimedia.org/T387892)
[09:40:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow6001.drmrs.wmnet to plain
[09:40:53] <wikibugs>	 (03PS5) 10Vgutierrez: acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020)
[09:40:59] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez)
[09:42:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6002.drmrs.wmnet to plain
[09:42:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Hotfixes release: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1165835
[09:42:59] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.dns.netbox
[09:43:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Hotfixes release: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1165835 (owner: 10Giuseppe Lavagetto)
[09:43:46] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes: api auth and bwlimit rules - oblivian@cumin1003"
[09:43:48] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes: api auth and bwlimit rules - oblivian@cumin1003
[09:43:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6002.drmrs.wmnet to plain
[09:44:03] <kostajh>	 jouncebot: nowandnext
[09:44:03] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0800)
[09:44:03] <jouncebot>	 In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1000)
[09:44:17] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes: api auth and bwlimit rules - oblivian@cumin1003
[09:44:18] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes: api auth and bwlimit rules - oblivian@cumin1003"
[09:44:23] <kostajh>	 jnuche: can I sync a patch to wmf.8? 
[09:45:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus6002.drmrs.wmnet to plain
[09:45:40] <wikibugs>	 (03PS1) 10Kosta Harlan: UserInfoCard: prevent default link behavior with "click" [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165836 (https://phabricator.wikimedia.org/T398323)
[09:46:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus6002.drmrs.wmnet to plain
[09:46:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6002.drmrs.wmnet to plain
[09:47:13] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002"
[09:47:16] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez)
[09:47:37] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002"
[09:47:37] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:47:38] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup2001.codfw.wmnet
[09:47:43] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "2nd pass lgtm!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[09:47:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6002.drmrs.wmnet to plain
[09:48:34] <kostajh>	 I assume it's OK, so I am proceeding 
[09:49:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165836 (https://phabricator.wikimedia.org/T398323) (owner: 10Kosta Harlan)
[09:49:23] <vgutierrez>	 !log acme-chief: stop issuing RSA certificates by default - T398020
[09:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:25] <stashbot>	 T398020: Stop issuing RSA certificates - https://phabricator.wikimedia.org/T398020
[09:49:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6002.wikimedia.org to plain
[09:50:01] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:50:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6002.wikimedia.org to plain
[09:50:51] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1093.eqiad.wmnet with reason: host reimage
[09:50:59] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:51:12] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Remove backup2001, old offsite backup host [puppet] - 10https://gerrit.wikimedia.org/r/1165823 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[09:51:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] eqiad/codfw: remove more former TE [homer/public] - 10https://gerrit.wikimedia.org/r/1165732 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi)
[09:51:47] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] eqiad/codfw: remove more former TE [homer/public] - 10https://gerrit.wikimedia.org/r/1165732 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi)
[09:51:59] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:52:18] <wikibugs>	 (03Merged) 10jenkins-bot: eqiad/codfw: remove more former TE [homer/public] - 10https://gerrit.wikimedia.org/r/1165732 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi)
[09:53:08] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti6004.drmrs.wmnet with reason: reimage
[09:53:21] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission backup2001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398188#10966747 (10jcrespo)
[09:53:43] <wikibugs>	 (03PS3) 10Hashar: gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693)
[09:53:44] <wikibugs>	 (03CR) 10Hashar: "There are a few MediaWiki libraries I'd like to move `mediawiki/libs` (T125031). That would simplify the CI configuration in the new Zuul." [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[09:53:45] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission backup2001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398188#10966752 (10jcrespo) This is ready. Reminder it has 2 disks arrays attached.
[09:54:05] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:54:12] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1093.eqiad.wmnet with reason: host reimage
[09:54:30] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397970)
[09:54:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti6004.drmrs.wmnet with OS bookworm
[09:54:57] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10966769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6004.drmrs.wmnet with OS bookworm
[09:55:02] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts backup1001.eqiad.wmnet
[09:55:14] <wikibugs>	 (03PS2) 10Volans: debmonitor: fix debmonitor_servers hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1165825 (https://phabricator.wikimedia.org/T397696)
[09:56:59] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1092.eqiad.wmnet with reason: host reimage
[09:58:32] <wikibugs>	 (03Merged) 10jenkins-bot: UserInfoCard: prevent default link behavior with "click" [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165836 (https://phabricator.wikimedia.org/T398323) (owner: 10Kosta Harlan)
[09:58:32] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:58:43] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165824 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto)
[09:58:57] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1165836|UserInfoCard: prevent default link behavior with "click" (T398323)]]
[09:59:00] <stashbot>	 T398323: UserInfoCard: Browser jumps to the top of the page when opening card - https://phabricator.wikimedia.org/T398323
[09:59:54] <wikibugs>	 (03PS1) 10Volans: postinst: clear stale files [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165839 (https://phabricator.wikimedia.org/T397696)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1000)
[10:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:00:31] <wikibugs>	 (03CR) 10Volans: [C:03+2] debmonitor: fix debmonitor_servers hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1165825 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[10:00:58] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.dns.netbox
[10:01:20] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1165836|UserInfoCard: prevent default link behavior with "click" (T398323)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:01:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Update account settings for aude [puppet] - 10https://gerrit.wikimedia.org/r/1165840
[10:02:14] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1092.eqiad.wmnet with reason: host reimage
[10:03:07] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[10:04:36] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002"
[10:04:53] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002"
[10:04:53] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:04:54] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup1001.eqiad.wmnet
[10:07:29] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Remove backup1001 old backup director host from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1165822 (https://phabricator.wikimedia.org/T387892)
[10:08:12] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Remove backup1001 old backup director host from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1165822 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[10:08:50] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165836|UserInfoCard: prevent default link behavior with "click" (T398323)]] (duration: 09m 52s)
[10:08:52] <stashbot>	 T398323: UserInfoCard: Browser jumps to the top of the page when opening card - https://phabricator.wikimedia.org/T398323
[10:09:00] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::static: Put HAProxy in front of the Nginx instance [puppet] - 10https://gerrit.wikimedia.org/r/1165841 (https://phabricator.wikimedia.org/T397634)
[10:09:11] <kostajh>	 done deploying 
[10:11:11] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission backup1001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398185#10966840 (10jcrespo)
[10:11:26] <wikibugs>	 (03PS2) 10Hnowlan: mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750)
[10:13:12] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1003"
[10:13:46] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484)
[10:14:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update account settings for aude [puppet] - 10https://gerrit.wikimedia.org/r/1165840 (owner: 10Muehlenhoff)
[10:14:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti6004.drmrs.wmnet with reason: host reimage
[10:14:31] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[10:15:53] <kostajh>	 jnuche: should we deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1165160 to fix the logspam for now? 
[10:16:16] <logmsgbot>	 mvernon@cumin1003 reimage (PID 4138533) is awaiting input
[10:17:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti6004.drmrs.wmnet with reason: host reimage
[10:17:41] <icinga-wm>	 PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor
[10:18:46] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1003"
[10:18:46] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1093.eqiad.wmnet with OS bullseye
[10:18:49] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::static: Put HAProxy in front of the Nginx instance [puppet] - 10https://gerrit.wikimedia.org/r/1165841 (https://phabricator.wikimedia.org/T397634)
[10:18:49] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::static: Handle simple redirects in HAProxy config [puppet] - 10https://gerrit.wikimedia.org/r/1165843 (https://phabricator.wikimedia.org/T397634)
[10:18:55] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1093.eqiad.wmnet with OS bullseye complete...
[10:19:31] <icinga-wm>	 RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 679 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[10:19:39] <jnuche>	 kostajh: a fix for that would be awesome, it's the largest single type of error in the logs right now
[10:20:38] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484)
[10:21:00] <wikibugs>	 (03PS1) 10Zabe: maintain-views: Use linktarget and collation in categorylinks view [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951)
[10:21:05] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1003"
[10:21:13] <wikibugs>	 (03Abandoned) 10Jforrester: FunctionEvaluator.vue: prod bug - js error for functions with Typed list as input param [extensions/WikiLambda] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163390 (https://phabricator.wikimedia.org/T397682) (owner: 10Jforrester)
[10:21:22] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1003"
[10:21:23] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1092.eqiad.wmnet with OS bullseye
[10:21:32] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1092.eqiad.wmnet with OS bullseye complete...
[10:21:42] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[10:22:25] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165824 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto)
[10:22:26] <wikibugs>	 (03PS1) 10Klausman: ml-services/experimental: remove leftover editcheck services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165845 (https://phabricator.wikimedia.org/T397013)
[10:23:00] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services/experimental: remove leftover editcheck services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165845 (https://phabricator.wikimedia.org/T397013) (owner: 10Klausman)
[10:24:04] <wikibugs>	 (03CR) 10Klausman: [C:03+2] ml-services/experimental: remove leftover editcheck services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165845 (https://phabricator.wikimedia.org/T397013) (owner: 10Klausman)
[10:24:20] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] api-gateway: use more recent ratelimit image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165475 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan)
[10:24:23] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165824 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto)
[10:25:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:25:42] <wikibugs>	 (03PS2) 10Zabe: maintain-views: Use linktarget and collation in categorylinks view [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951)
[10:26:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966891 (10MatthewVernon) @Jclark-ctr the problem with these two nodes was the same as we've had with every one of this batch of Dell servers - they arri...
[10:26:11] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services/experimental: remove leftover editcheck services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165845 (https://phabricator.wikimedia.org/T397013) (owner: 10Klausman)
[10:26:21] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165475 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan)
[10:26:21] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966893 (10MatthewVernon)
[10:26:44] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[10:27:06] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[10:27:28] <logmsgbot>	 !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[10:27:47] <wikibugs>	 (03PS1) 10Zabe: group1: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165846 (https://phabricator.wikimedia.org/T397912)
[10:27:51] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[10:28:00] <logmsgbot>	 !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[10:28:10] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: use more recent ratelimit image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165475 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan)
[10:28:19] <logmsgbot>	 !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[10:28:57] <logmsgbot>	 !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[10:29:21] <wikibugs>	 (03PS1) 10Volans: kubernetes: fine-tune displayed name [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696)
[10:29:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[10:30:31] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[10:32:09] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:33:19] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:33:56] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2248.codfw.wmnet
[10:35:19] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply
[10:35:31] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[10:36:31] <kostajh>	 jnuche: we may make a different patch, later today 
[10:37:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#10966992 (10Clement_Goubert) We don't particularly need the node in production as we have spare capacity, if you need them depooled for testing w...
[10:37:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10966999 (10Clement_Goubert) We don't particularly need the node in production as we have spare capacity, if you need them depooled for testing we...
[10:38:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mc-gp2004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1165849
[10:38:26] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165849 (owner: 10Muehlenhoff)
[10:38:35] <wikibugs>	 (03PS1) 10Elukey: admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850
[10:39:07] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2248.codfw.wmnet
[10:39:10] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2249.codfw.wmnet
[10:39:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6004.drmrs.wmnet with OS bookworm
[10:39:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10967021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6004.drmrs.wmnet with OS bookworm completed: - ganeti6004 (**PASS*...
[10:40:00] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850 (owner: 10Elukey)
[10:40:19] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kubernetes: fine-tune displayed name [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[10:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:40:46] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[10:41:02] <jnuche>	 jnuche: ack, ty
[10:42:05] <wikibugs>	 (03PS2) 10Elukey: admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850
[10:42:32] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[10:42:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:42:54] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850 (owner: 10Elukey)
[10:43:37] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[10:43:40] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[10:44:02] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[10:44:31] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2249.codfw.wmnet
[10:44:34] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2250.codfw.wmnet
[10:45:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:46:07] <wikibugs>	 (03PS1) 10MVernon: hiera: add ms-be109[2-5] to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1165851 (https://phabricator.wikimedia.org/T391354)
[10:46:10] <wikibugs>	 (03PS1) 10MVernon: swift/eqiad: add ms-be109[2,3], drain ms-be1063 [puppet] - 10https://gerrit.wikimedia.org/r/1165852 (https://phabricator.wikimedia.org/T391354)
[10:46:18] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 137236
[10:47:01] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[10:47:08] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 137236
[10:47:17] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[10:47:46] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[10:47:57] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[10:48:07] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 37271
[10:48:51] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37271
[10:48:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:49:14] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[10:49:59] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2250.codfw.wmnet
[10:49:59] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[10:50:02] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2251.codfw.wmnet
[10:51:13] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] hiera: add ms-be109[2-5] to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1165851 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[10:51:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet
[10:52:04] <wikibugs>	 (03CR) 10MVernon: [C:03+2] hiera: add ms-be109[2-5] to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1165851 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[10:52:21] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] swift/eqiad: add ms-be109[2,3], drain ms-be1063 [puppet] - 10https://gerrit.wikimedia.org/r/1165852 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[10:52:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:53:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:55:19] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2251.codfw.wmnet
[10:55:22] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2252.codfw.wmnet
[10:59:17] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift/eqiad: add ms-be109[2,3], drain ms-be1063 [puppet] - 10https://gerrit.wikimedia.org/r/1165852 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[10:59:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet
[11:00:05] <jouncebot>	 mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1100). nyaa~
[11:00:28] <wikibugs>	 (03PS1) 10Tiziano Fogli: pontoon: add fw rules to allow titan to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/1165855
[11:00:50] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2252.codfw.wmnet
[11:00:54] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2253.codfw.wmnet
[11:03:06] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[11:04:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:04:35] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10967120 (10MatthewVernon)
[11:04:39] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2253.codfw.wmnet
[11:04:42] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2254.codfw.wmnet
[11:06:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:58] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2254.codfw.wmnet
[11:10:01] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2255.codfw.wmnet
[11:12:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti6004.drmrs.wmnet to cluster drmrs02 and group B13
[11:14:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti6004.drmrs.wmnet to cluster drmrs02 and group B13
[11:15:39] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2255.codfw.wmnet
[11:15:42] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2256.codfw.wmnet
[11:16:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to drbd
[11:16:32] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of netflow6001.drmrs.wmnet to drbd
[11:16:41] <icinga-wm>	 PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor
[11:17:33] <icinga-wm>	 RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 679 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[11:18:45] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[11:19:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to drbd
[11:19:44] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+1] "Not sure what's the appropriate way to merge this, backport?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152406 (https://phabricator.wikimedia.org/T395425) (owner: 10Urbanecm)
[11:20:58] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2256.codfw.wmnet
[11:21:02] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2257.codfw.wmnet
[11:23:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:26:30] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2257.codfw.wmnet
[11:26:33] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2258.codfw.wmnet
[11:28:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow6001.drmrs.wmnet to drbd
[11:28:45] <icinga-wm>	 PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:29:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:29:39] <icinga-wm>	 RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 87.52 ms
[11:30:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] Switch mc-gp2004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1165849 (owner: 10Muehlenhoff)
[11:31:49] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2258.codfw.wmnet
[11:31:53] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2259.codfw.wmnet
[11:31:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:32:53] <jinxer-wm>	 FIRING: FNMNotReported: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[11:33:28] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2004.codfw.wmnet
[11:33:42] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:33:42] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 37271
[11:35:22] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 37271
[11:37:05] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2259.codfw.wmnet
[11:37:08] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2260.codfw.wmnet
[11:37:53] <jinxer-wm>	 RESOLVED: FNMNotReported: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[11:38:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus6002.drmrs.wmnet to drbd
[11:40:03] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2004.codfw.wmnet
[11:40:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, though please move the setting to modules/pontoon/files/settings/titan.yaml since alerting_host can function without titan and the v" [puppet] - 10https://gerrit.wikimedia.org/r/1165855 (owner: 10Tiziano Fogli)
[11:41:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165486 (owner: 10Muehlenhoff)
[11:42:20] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2260.codfw.wmnet
[11:42:24] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2261.codfw.wmnet
[11:42:32] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[11:43:40] <wikibugs>	 (03CR) 10Jgiannelos: mobileapps: allow setting terminationGracePeriodSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:45:20] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10967282 (10Jclark-ctr) @MatthewVernon Thanks for assistance
[11:45:36] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10967283 (10Jclark-ctr) 05Open→03Resolved
[11:46:34] <wikibugs>	 (03CR) 10Hnowlan: mobileapps: allow setting terminationGracePeriodSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:46:39] <icinga-wm>	 PROBLEM - Memcached on mc2050 is CRITICAL: connect to address 10.192.32.82 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[11:47:31] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[11:47:37] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2261.codfw.wmnet
[11:47:40] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2262.codfw.wmnet
[11:48:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: memcached.service on mc2050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:49:57] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] mobileapps: allow setting terminationGracePeriodSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:50:29] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "LGTM but I'm in a conference and will have a hard time running the maintain views maybe Francesco can?" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe)
[11:51:23] <wikibugs>	 (03CR) 10FNegri: "Sure I'll do it!" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe)
[11:51:32] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:52:39] <icinga-wm>	 RECOVERY - Memcached on mc2050 is OK: TCP OK - 0.030 second response time on 10.192.32.82 port 11214 https://wikitech.wikimedia.org/wiki/Memcached
[11:52:56] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2262.codfw.wmnet
[11:53:00] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:53:00] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2263.codfw.wmnet
[11:53:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: memcached.service on mc2050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:54:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:54:40] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:55:27] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: wdqs-main-availability drmrs - https://slo.wikimedia.org/?search=wdqs-main-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[11:56:18] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165830 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[11:58:37] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2263.codfw.wmnet
[11:58:41] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2264.codfw.wmnet
[12:00:36] <icinga-wm>	 PROBLEM - puppetboard.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:01:26] <icinga-wm>	 RECOVERY - puppetboard.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:01:40] <icinga-wm>	 PROBLEM - Host prometheus6002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:03:39] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Thank you! <3" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe)
[12:04:08] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2264.codfw.wmnet
[12:04:11] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2265.codfw.wmnet
[12:04:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#10967364 (10Jclark-ctr) if you can deploy them that would be great so there is some load @Clement_Goubert
[12:06:36] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:07:46] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[12:08:23] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:08:33] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:09:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: user@499.service on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:09:35] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2265.codfw.wmnet
[12:09:38] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2266.codfw.wmnet
[12:10:31] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[12:10:58] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[12:14:39] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[12:14:51] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2266.codfw.wmnet
[12:14:54] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2267.codfw.wmnet
[12:19:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: user@499.service on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:20:10] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2267.codfw.wmnet
[12:20:14] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2268.codfw.wmnet
[12:24:57] <wikibugs>	 (03PS2) 10Tiziano Fogli: pontoon: add fw rules to allow titan to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/1165855
[12:25:30] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2268.codfw.wmnet
[12:25:33] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2269.codfw.wmnet
[12:25:49] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] pontoon: add fw rules to allow titan to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/1165855 (owner: 10Tiziano Fogli)
[12:27:21] <godog>	 mmhh prometheus6002 is unhappy
[12:27:23] <godog>	 checking
[12:28:21] <godog>	 ah mmhh admin_down, moritzm known ?
[12:28:27] <godog>	 prometheus6002.drmrs.wmnet kvm        debootstrap+default ganeti6002.drmrs.wmnet ADMIN_down      - 
[12:29:02] <moritzm>	 the host is being switch to DRBD as part of the bookworm update of ganeti/drmrs
[12:29:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: user@499.service on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:29:25] <moritzm>	 drmrs has the unfortunate 2x2 design, so this is inevitable
[12:29:26] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "@zabe@avorwerk.net I'll let you merge first, then I'll run maintain-views after this is merged." [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe)
[12:29:35] <moritzm>	 should be done in 10 mins approx
[12:29:53] <godog>	 ack thank you, I missed the drain-node invocation from earlier
[12:29:56] <godog>	 all good
[12:30:50] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2269.codfw.wmnet
[12:30:53] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2270.codfw.wmnet
[12:32:02] <moritzm>	 godog: drmrs is up for refresh in the next 12 months, then we'll create the new cluster with routed Ganeti, which doesn't have this issue
[12:32:14] <godog>	 moritzm: neat! looking forward to that
[12:36:19] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2270.codfw.wmnet
[12:36:23] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2271.codfw.wmnet
[12:41:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus6002.drmrs.wmnet to drbd
[12:41:35] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2271.codfw.wmnet
[12:41:39] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2272.codfw.wmnet
[12:42:32] <icinga-wm>	 RECOVERY - Host prometheus6002 is UP: PING OK - Packet loss = 0%, RTA = 87.43 ms
[12:44:28] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Move Impact limit configuration to ext-GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165865 (https://phabricator.wikimedia.org/T341599)
[12:44:30] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165866 (https://phabricator.wikimedia.org/T398418)
[12:45:27] <jinxer-wm>	 RESOLVED: [3x] SLOMetricAbsent: wdqs-main-availability drmrs - https://slo.wikimedia.org/?search=wdqs-main-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[12:45:41] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "Right, I completely missed that. Thanks for catching it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165865 (https://phabricator.wikimedia.org/T341599) (owner: 10Urbanecm)
[12:45:55] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] [Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165866 (https://phabricator.wikimedia.org/T398418) (owner: 10Urbanecm)
[12:46:37] <urbanecm>	 jouncebot: nowandnext
[12:46:38] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[12:46:38] <jouncebot>	 In 0 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1300)
[12:47:01] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2272.codfw.wmnet
[12:47:04] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2273.codfw.wmnet
[12:47:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165865 (https://phabricator.wikimedia.org/T341599) (owner: 10Urbanecm)
[12:47:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165866 (https://phabricator.wikimedia.org/T398418) (owner: 10Urbanecm)
[12:48:26] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Move Impact limit configuration to ext-GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165865 (https://phabricator.wikimedia.org/T341599) (owner: 10Urbanecm)
[12:48:33] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165866 (https://phabricator.wikimedia.org/T398418) (owner: 10Urbanecm)
[12:48:57] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1165865|[Growth] Move Impact limit configuration to ext-GrowthExperiments (T341599)]], [[gerrit:1165866|[Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 (T398418 T341599)]]
[12:49:01] <stashbot>	 T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599
[12:49:02] <stashbot>	 T398418: TypeError: array_map(): Argument #2 ($array) must be of type array, int given - https://phabricator.wikimedia.org/T398418
[12:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[12:51:16] <logmsgbot>	 !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1165865|[Growth] Move Impact limit configuration to ext-GrowthExperiments (T341599)]], [[gerrit:1165866|[Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 (T398418 T341599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:51:52] <wikibugs>	 (03CR) 10Zabe: "I do not have +2 for puppet" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe)
[12:52:17] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2273.codfw.wmnet
[12:52:21] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2274.codfw.wmnet
[12:52:58] <logmsgbot>	 !log urbanecm@deploy1003 urbanecm: Continuing with sync
[12:55:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission backup1001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398185#10967592 (10Jclark-ctr) a:03Jclark-ctr
[12:55:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission backup1001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398185#10967594 (10Jclark-ctr) 05Open→03Resolved
[12:55:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6002.drmrs.wmnet to drbd
[12:56:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10967599 (10Jclark-ctr) @Stevemunene  is this on hold by anything else in Eqiad?
[12:57:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10967604 (10Jclark-ctr) a:03VRiley-WMF
[12:57:37] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2274.codfw.wmnet
[12:57:41] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2275.codfw.wmnet
[12:58:40] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165865|[Growth] Move Impact limit configuration to ext-GrowthExperiments (T341599)]], [[gerrit:1165866|[Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 (T398418 T341599)]] (duration: 09m 42s)
[12:58:43] <stashbot>	 T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599
[12:58:44] <stashbot>	 T398418: TypeError: array_map(): Argument #2 ($array) must be of type array, int given - https://phabricator.wikimedia.org/T398418
[12:58:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10967608 (10Jclark-ctr) @MoritzMuehlenhoff  is this still an issue could you verify again and we can try a different cable / brand cable if it is still slow?
[12:59:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10967611 (10Jclark-ctr) a:03Jclark-ctr
[13:00:04] <jouncebot>	 Urbanecm and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1300).
[13:00:04] <jouncebot>	 EggRoll97: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10967614 (10Jclark-ctr) @RobH  Can we close this task now that a decision has been made?
[13:02:19] <EggRoll97>	 o/
[13:02:34] <TheresNoTime>	 EggRoll97: o/ just reading the background on the patches a moment, given there's legal approval and the such :)
[13:02:48] <EggRoll97>	 All good, sorry I'm a couple minutes late, I was having issues with IRC
[13:02:53] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2275.codfw.wmnet
[13:02:56] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2276.codfw.wmnet
[13:03:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:04:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162158 (https://phabricator.wikimedia.org/T265726) (owner: 10EggRoll97)
[13:04:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164637 (https://phabricator.wikimedia.org/T398107) (owner: 10EggRoll97)
[13:04:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10967629 (10Jclark-ctr) Dell just responded they are going to send a new backplane for this device it will probably not arrive till Thursday /Satur...
[13:05:22] <wikibugs>	 (03Merged) 10jenkins-bot: Assign oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162158 (https://phabricator.wikimedia.org/T265726) (owner: 10EggRoll97)
[13:05:24] <wikibugs>	 (03Merged) 10jenkins-bot: Add abusefilter-revert to sysops on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164637 (https://phabricator.wikimedia.org/T398107) (owner: 10EggRoll97)
[13:05:48] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1162158|Assign oathauth-verify-user to default bureaucrat (T265726)]], [[gerrit:1164637|Add abusefilter-revert to sysops on testwiki (T398107)]]
[13:05:52] <stashbot>	 T265726: Assign oathauth-verify-user to bureaucrats on WMF wikis - https://phabricator.wikimedia.org/T265726
[13:05:53] <stashbot>	 T398107: Enable abusefilter-revert on testwiki - https://phabricator.wikimedia.org/T398107
[13:06:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1165874
[13:07:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6002.drmrs.wmnet to drbd
[13:07:44] <icinga-wm>	 PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:07:48] <icinga-wm>	 RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 87.58 ms
[13:08:05] <logmsgbot>	 !log samtar@deploy1003 samtar, eggroll97: Backport for [[gerrit:1162158|Assign oathauth-verify-user to default bureaucrat (T265726)]], [[gerrit:1164637|Add abusefilter-revert to sysops on testwiki (T398107)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:08:11] <TheresNoTime>	 EggRoll97: those are both available to test on mwdebug
[13:08:12] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2276.codfw.wmnet
[13:08:16] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2277.codfw.wmnet
[13:08:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6002.drmrs.wmnet to drbd
[13:08:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:10:02] <wikibugs>	 (03PS1) 10Jgreen: Adjust codfw payments hostnames after deprecating LVS servers. [dns] - 10https://gerrit.wikimedia.org/r/1165877 (https://phabricator.wikimedia.org/T398321)
[13:10:36] <EggRoll97>	 TheresNoTime: seems fine from what I can see
[13:11:26] <wikibugs>	 (03CR) 10Jgreen: [V:03+1 C:03+2] Adjust codfw payments hostnames after deprecating LVS servers. [dns] - 10https://gerrit.wikimedia.org/r/1165877 (https://phabricator.wikimedia.org/T398321) (owner: 10Jgreen)
[13:11:31] <logmsgbot>	 !log samtar@deploy1003 samtar, eggroll97: Continuing with sync
[13:11:58] <logmsgbot>	 !log jgreen@dns1004 START - running authdns-update
[13:12:06] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:13:06] <logmsgbot>	 !log jgreen@dns1004 END - running authdns-update
[13:13:27] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2277.codfw.wmnet
[13:13:31] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2278.codfw.wmnet
[13:15:32] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudceph: move per-host puppet7 def to role [puppet] - 10https://gerrit.wikimedia.org/r/1165587
[13:15:32] <wikibugs>	 (03PS1) 10Andrew Bogott: Include repo for ceph v16 'pacific' on  cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1165878 (https://phabricator.wikimedia.org/T306820)
[13:15:43] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165587 (owner: 10Andrew Bogott)
[13:15:47] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165878 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott)
[13:16:33] <wikibugs>	 (03CR) 10Klausman: [C:03+1] amd-pytorch21: delete torch 2.1.2 + ROCm 5.6 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1164329 (owner: 10Ilias Sarantopoulos)
[13:17:05] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1162158|Assign oathauth-verify-user to default bureaucrat (T265726)]], [[gerrit:1164637|Add abusefilter-revert to sysops on testwiki (T398107)]] (duration: 11m 16s)
[13:17:13] <stashbot>	 T265726: Assign oathauth-verify-user to bureaucrats on WMF wikis - https://phabricator.wikimedia.org/T265726
[13:17:13] <stashbot>	 T398107: Enable abusefilter-revert on testwiki - https://phabricator.wikimedia.org/T398107
[13:17:22] <TheresNoTime>	 EggRoll97: live on production :)
[13:17:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6002.drmrs.wmnet to drbd
[13:17:32] <EggRoll97>	 Yay, thanks TheresNoTime
[13:17:41] <icinga-wm>	 PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:17:51] <sukhe>	 ^ expected, moritzm is working
[13:18:01] <moritzm>	 !log installing rsyslog bugfix updates from Bookworm point release
[13:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6002.wikimedia.org to drbd
[13:18:36] <icinga-wm>	 RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 87.59 ms
[13:18:43] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2278.codfw.wmnet
[13:18:47] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2279.codfw.wmnet
[13:19:58] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:20:13] <wikibugs>	 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10Mail: Add DMarcian trial-account address to the dmarc-ruf@wikimedia.org postfix mailing list - https://phabricator.wikimedia.org/T396062#10967738 (10Jgreen)
[13:20:58] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:21:23] <_joe_>	 !log depooling cp7006 for testing
[13:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:48] <wikibugs>	 (03PS3) 10Vgutierrez: hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484)
[13:23:48] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch eqsin to the new upload cert [puppet] - 10https://gerrit.wikimedia.org/r/1165888 (https://phabricator.wikimedia.org/T394484)
[13:24:06] <wikibugs>	 (03CR) 10Andrew Bogott: "pcc failed but only because my wildcard didn't work." [puppet] - 10https://gerrit.wikimedia.org/r/1165587 (owner: 10Andrew Bogott)
[13:24:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudceph: move per-host puppet7 def to role [puppet] - 10https://gerrit.wikimedia.org/r/1165587 (owner: 10Andrew Bogott)
[13:24:15] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2279.codfw.wmnet
[13:24:18] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2280.codfw.wmnet
[13:24:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Include repo for ceph v16 'pacific' on  cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1165878 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott)
[13:24:26] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165888 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[13:24:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:24:49] <zabe>	 TheresNoTime: are you done deploying?
[13:25:01] <TheresNoTime>	 zabe: yes sorry, forgot to say :)
[13:25:35] <zabe>	 no worries, just wanted to be sure
[13:25:43] <wikibugs>	 (03CR) 10Zabe: [C:03+2] group1: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165846 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[13:26:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6002.wikimedia.org to drbd
[13:26:24] <icinga-wm>	 PROBLEM - Host doh6002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:26:40] <icinga-wm>	 RECOVERY - Host doh6002 is UP: PING OK - Packet loss = 0%, RTA = 87.55 ms
[13:26:53] <wikibugs>	 (03PS1) 10David Martin: wikifunctions: Upgrade orchestrator from 2025-06-24-204920 to 2025-07-02-123323 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165890 (https://phabricator.wikimedia.org/T391208)
[13:27:04] <wikibugs>	 (03Merged) 10jenkins-bot: group1: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165846 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[13:27:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet
[13:27:32] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165846|group1: Set categorylinks to read new (T397912)]]
[13:27:34] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[13:27:43] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10967771 (10ops-monitoring-bot) Draining ganeti6002.drmrs.wmnet of running VMs
[13:28:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet
[13:28:41] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:29:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:29:47] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2280.codfw.wmnet
[13:29:51] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2281.codfw.wmnet
[13:30:02] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1165846|group1: Set categorylinks to read new (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:30:06] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:30:30] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10967775 (10MoritzMuehlenhoff)
[13:30:32] <moritzm>	 !log failover Ganeti master in drmrs02 to ganeti6004 T382513
[13:30:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:35] <stashbot>	 T382513: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513
[13:30:40] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:30:47] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442 (10Jgreen) 03NEW
[13:30:54] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[13:31:23] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Switch eqsin to the new upload cert [puppet] - 10https://gerrit.wikimedia.org/r/1165888 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[13:33:40] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[13:34:42] <urbanecm>	 zabe: if you're going to deploy something else too, can you take https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1152406 with you?
[13:34:45] <urbanecm>	 it's a docs-only patch
[13:35:19] <wikibugs>	 (03CR) 10Ssingh: "I think this is ready to ship IMO -- how have you tested this out so far? I want to give it a test run and then happy to +1!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[13:35:26] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2281.codfw.wmnet
[13:35:30] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2282.codfw.wmnet
[13:35:33] <zabe>	 sure
[13:37:01] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Switch eqsin to the new upload cert [puppet] - 10https://gerrit.wikimedia.org/r/1165888 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[13:37:40] <vgutierrez>	 !log switch upload@eqsin to the new upload cert - T394484
[13:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:07] <stashbot>	 T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484
[13:38:08] <zabe>	 why is the "left: " counter increasing ..
[13:38:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.702s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:39:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152406 (https://phabricator.wikimedia.org/T395425) (owner: 10Urbanecm)
[13:39:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:39:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0.8663% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:39:32] <claime>	 hmm
[13:39:34] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:39:44] <claime>	 that ain't good
[13:40:05] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] docs: Document why weighed tags cannot be updated via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152406 (https://phabricator.wikimedia.org/T395425) (owner: 10Urbanecm)
[13:40:24] <zabe>	 I aborted scap
[13:40:30] <zabe>	 Will try another sync-world
[13:40:37] <zabe>	 let us see how that goes
[13:40:52] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: T397912
[13:40:58] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2282.codfw.wmnet
[13:41:01] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2283.codfw.wmnet
[13:41:35] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Rename EventRegistration::$meetingAddress to $address for cache compat [extensions/CampaignEvents] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165894 (https://phabricator.wikimedia.org/T398413)
[13:41:44] <stashbot>	 zabe@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
[13:41:44] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[13:42:22] <claime>	 "MediaWiki\Exception\UserNotLoggedIn: Please log in to be able to access this page or action." spike
[13:42:57] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:43:15] <urbanecm>	 claime: uhoh. i know what that is...
[13:43:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:43:26] <jynus>	 deployment-related?
[13:43:45] <vgutierrez>	 !incidents
[13:43:45] <sirenbot>	 6445 (UNACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[13:43:47] <jynus>	 I guess so, based on backlog
[13:43:49] <vgutierrez>	 !ack 6445
[13:43:49] <sirenbot>	 6445 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[13:43:57] <claime>	 actually that's not what's spiking
[13:44:01] <urbanecm>	 jynus: if UserNotLoggedIn is the cause, it'd say it's traffic related, but...i did not look at any logs
[13:44:14] <jynus>	 ok
[13:44:15] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:44:15] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:44:51] <claime>	 jobqueue issues Oo
[13:44:54] <ebernhardson>	 getting some errors, but it looks like already looking into it
[13:44:56] <logmsgbot>	 !log zabe@deploy1003 sync-world aborted: T397912 (duration: 04m 03s)
[13:45:02] <wikibugs>	 (03PS1) 10Zabe: Revert "group1: Set categorylinks to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165897
[13:45:07] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: activate new plugins packages - bking@cumin1002 - T397227
[13:45:08] <logmsgbot>	 !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: activate new plugins packages - bking@cumin1002 - T397227
[13:45:08] <claime>	 +query errors
[13:45:24] <zabe>	 It started showing up during deploying that one 
[13:45:34] <urbanecm>	 a lot of `Error: 2006 MySQL server has gone away` it seems
[13:45:37] <claime>	 Error: 2006 MySQL server has gone away
[13:45:39] <claime>	 yeah
[13:45:43] <wikibugs>	 (03CR) 10Zabe: [V:03+2 C:03+2] Revert "group1: Set categorylinks to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165897 (owner: 10Zabe)
[13:45:50] <claime>	 on commons
[13:46:07] <ebernhardson>	 i'm also getting this from api.php on enwiki: Original error: upstream connect error or disconnect/reset before headers. reset reason: connection failure
[13:46:08] <zabe>	 Although I do not really see the connection to jobqueu
[13:46:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6002.wikimedia.org to plain
[13:46:13] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2283.codfw.wmnet
[13:46:16] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2284.codfw.wmnet
[13:46:20] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new plugins packages - bking@cumin1002 - T397227
[13:46:22] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]]
[13:46:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6002.wikimedia.org to plain
[13:47:09] <claime>	 we'll wait and see if the revert roll out calms things down
[13:47:20] <urbanecm>	 (situations like this make me ask "is there a faster way to sync something than `wait 10 mins`")
[13:47:22] <effie>	 I think the convo should be moved to -sre, and anyone involved report what they know 
[13:47:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6002.drmrs.wmnet to plain
[13:48:00] <zabe>	 I think my patches caused some slow queries which overloaded commons db?
[13:48:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:48:41] <effie>	 zabe: #wikimedia-sre please :)
[13:48:42] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:48:49] <sukhe>	 moritzm: ^ should I downtime these?
[13:48:50] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:49:06] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:49:09] <claime>	 zabe: possible
[13:49:15] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:49:23] <jynus>	 350 million queries per second on commons
[13:49:27] <jynus>	 that cannot be handled
[13:49:42] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[13:49:46] <claime>	 jynus: -sre please
[13:49:51] <urbanecm>	 effie: otoh, https://wikitech.wikimedia.org/wiki/Backport_windows says deployment-related convo should happen in here...
[13:50:25] <logmsgbot>	 jmm@cumin2002 changedisk (PID 4005626) is awaiting input
[13:50:27] <moritzm>	 sukhe: each should resolve within 30seconds, so should be fine
[13:50:30] <sukhe>	 ok :)
[13:50:40] <moritzm>	 likewise for durum6002, which is incoming
[13:50:40] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:51:00] <zabe>	 ok, but I do not really see how my patch could increase the number of queries, only how it could make them slow
[13:51:14] <claime>	 urbanecm: yeah but it's impossible to follow with botnoise
[13:51:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6002.drmrs.wmnet to plain
[13:51:28] <claime>	 so sre debugging goes to -sre for the moment
[13:51:31] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2284.codfw.wmnet
[13:51:35] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2285.codfw.wmnet
[13:51:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[13:51:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:51:54] <vgutierrez>	 wut?
[13:51:56] <_joe_>	 that's my fault
[13:51:57] <vgutierrez>	 !incidents
[13:51:57] <sirenbot>	 6445 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[13:51:57] <sirenbot>	 6446 (UNACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[13:52:01] <_joe_>	 if it's magru
[13:52:02] <vgutierrez>	 !ack 6446
[13:52:03] <sirenbot>	 6446 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[13:52:11] <vgutierrez>	 _joe_: how?
[13:52:12] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:52:12] <_joe_>	 ah no if it's everything then it's not
[13:52:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6002.drmrs.wmnet to plain
[13:52:25] <_joe_>	 vgutierrez: I briefly repooled the server, for like 3 minutes
[13:52:31] <vgutierrez>	 oh :D
[13:52:40] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2025/2026-Q1): Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10967929 (10lmata)
[13:52:41] <_joe_>	 but tbh this seems to be related to the api issues
[13:52:43] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[13:52:48] <_joe_>	 yeeep
[13:52:53] <vgutierrez>	 !incidents
[13:52:53] <sirenbot>	 6445 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[13:52:53] <sirenbot>	 6446 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[13:52:54] <sirenbot>	 6447 (UNACKED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[13:52:57] <_joe_>	 I assume the oncall people are looking into it
[13:52:57] <vgutierrez>	 !ack 6447
[13:52:57] <sirenbot>	 6447 (ACKED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[13:52:58] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:53:13] <_joe_>	 ah that would be you vgutierrez, sorry, have fun
[13:53:44] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:53:55] <vgutierrez>	 _joe_: I'm here to coordinate it, not solve it :D
[13:54:08] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 6.212 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:54:12] <vgutierrez>	 claime: are you still waiting on the rollback?
[13:54:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6002.drmrs.wmnet to plain
[13:54:27] <claime>	 yeah
[13:54:29] <claime>	 it's ongoing
[13:54:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:54:34] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:54:38] <zabe>	 13:54:26 K8s deployment progress:  85% (ok: 1948; fail: 0; left: 321) \         
[13:54:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:54:58] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:55:00] <vgutierrez>	 !incidents
[13:55:00] <sirenbot>	 6445 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[13:55:00] <sirenbot>	 6446 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[13:55:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to plain
[13:55:00] <sirenbot>	 6447 (ACKED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[13:55:01] <sirenbot>	 6448 (UNACKED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[13:55:03] <vgutierrez>	 !ack 6448
[13:55:03] <sirenbot>	 6448 (ACKED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[13:55:04] <sukhe>	 !ack 6448
[13:55:04] <sirenbot>	 6448 (ACKED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[13:55:06] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:55:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow6001.drmrs.wmnet to plain
[13:55:59] <wikibugs>	 (03PS1) 10Btullis: Bump resources and shared buffers config for postgresql-airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165899 (https://phabricator.wikimedia.org/T398421)
[13:56:00] <wikibugs>	 (03PS1) 10Btullis: Increase max_tis_per_query for airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686)
[13:56:04] <vgutierrez>	 5xx in ATS are already starting to decrease
[13:56:25] <Gommeh>	 is that a good thing?
[13:56:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[13:56:47] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2285.codfw.wmnet
[13:56:50] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2286.codfw.wmnet
[13:56:58] <vgutierrez>	 Gommeh: yes
[13:57:20] <wikibugs>	 (03PS1) 10Elukey: profile::thanos::swift: rework machinetranslation account [puppet] - 10https://gerrit.wikimedia.org/r/1165901 (https://phabricator.wikimedia.org/T335491)
[13:57:33] <vgutierrez>	 ATS is the cache layer that speaks to the applayer and it was recording an unexpected high number of 5xx from mw-api-ext-ro
[13:57:43] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[13:57:51] <wikibugs>	 (03Abandoned) 10Elukey: profile::thanos::swift: rework machinetranslation account [puppet] - 10https://gerrit.wikimedia.org/r/1165901 (https://phabricator.wikimedia.org/T335491) (owner: 10Elukey)
[13:58:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:58:17] <Gommeh>	 vgutierrez english please
[13:58:23] <Gommeh>	 new to this lol
[13:59:51] <jinxer-wm>	 FIRING: [8x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[14:00:07] <vgutierrez>	 !incidents
[14:00:08] <sirenbot>	 6445 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[14:00:08] <sirenbot>	 6448 (ACKED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[14:00:08] <sirenbot>	 6447 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[14:00:08] <sirenbot>	 6446 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[14:00:08] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1400)
[14:00:26] <vgutierrez>	 claime: 5xx back again to ~3.5k rps
[14:00:43] <vgutierrez>	 everything on mw-api-ext-ro
[14:00:48] <_joe_>	 being okta'd during incident response: priceless
[14:00:57] <vgutierrez>	 Gommeh: ping me later after the incident ends :)
[14:00:57] <_joe_>	 yes there isn't one pod that's ready in eqiad
[14:01:16] <wikibugs>	 (03CR) 10Elukey: "Today I discovered that `swift post -r` can grant to multiple users the read ACLs, and the new config seems more inline with what we need " [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman)
[14:01:26] <_joe_>	 which makes me thing it's not just zabe's patch
[14:01:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus6002.drmrs.wmnet to plain
[14:01:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[14:01:46] <_joe_>	 but, to allow for systems to recover, shouldd we ban all requests to the action api for commons?
[14:02:00] <vgutierrez>	 !incidents
[14:02:00] <sirenbot>	 6445 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[14:02:00] <sirenbot>	 6448 (ACKED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[14:02:01] <sirenbot>	 6449 (UNACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[14:02:01] <sirenbot>	 6447 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[14:02:01] <sirenbot>	 6446 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[14:02:02] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2286.codfw.wmnet
[14:02:04] <vgutierrez>	 !ack 6449
[14:02:04] <sirenbot>	 6449 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[14:02:06] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2287.codfw.wmnet
[14:02:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus6002.drmrs.wmnet to plain
[14:03:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:03:29] <vgutierrez>	 mysql seems to be onfire since 13:30 in terms of rows read
[14:03:35] <wikibugs>	 (03PS2) 10Cory Massaro: wikifunctions: Enable batching in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156430
[14:03:39] <wikibugs>	 (03Abandoned) 10Jforrester: wikifunctions: Enable batching in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156430 (owner: 10Cory Massaro)
[14:04:34] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:04:51] <jinxer-wm>	 FIRING: [8x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[14:04:57] <vgutierrez>	 !incidents
[14:04:57] <sirenbot>	 6445 (ACKED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[14:04:57] <sirenbot>	 6448 (ACKED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[14:04:57] <sirenbot>	 6449 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[14:04:58] <sirenbot>	 6447 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[14:04:58] <sirenbot>	 6446 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[14:05:03] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:05:44] <wikibugs>	 10SRE-SLO, 10observability, 10SRE Observability (FY2025/2026-Q1): Add a banner to slo.wikimedia.org explaining rolling vs calendar views - https://phabricator.wikimedia.org/T398313#10967982 (10lmata)
[14:06:19] <wikibugs>	 10SRE-SLO, 10observability, 10SRE Observability (FY2025/2026-Q1): Add links in the Pyrra rolling dashboards to point to their calendar ones in Grafana - https://phabricator.wikimedia.org/T398311#10967984 (10lmata)
[14:06:24] <wikibugs>	 (03CR) 10Urbanecm: "scap backport when no one is doing anything is an appropriate action in this case (since it is a labs-only change, it will amount to git p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152406 (https://phabricator.wikimedia.org/T395425) (owner: 10Urbanecm)
[14:06:38] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti6002.drmrs.wmnet with reason: reimage
[14:06:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[14:07:22] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2287.codfw.wmnet
[14:07:26] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2288.codfw.wmnet
[14:07:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:08:23] <vgutierrez>	 ok... 5xx down to ~1k rps
[14:08:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:08:32] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:08:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti6002.drmrs.wmnet with OS bookworm
[14:08:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10967993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6002.drmrs.wmnet with OS bookworm
[14:09:06] <wikibugs>	 (03PS1) 10David Martin: wikifunctions: Upgrade evaluator from 2025-06-23-151702 to 2025-07-02-122843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165903 (https://phabricator.wikimedia.org/T391208)
[14:09:34] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:09:51] <jinxer-wm>	 FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[14:10:18] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10967996 (10MoritzMuehlenhoff)
[14:10:47] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:11:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:11:47] <wikibugs>	 (03PS5) 10Klausman: hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491)
[14:11:56] <wikibugs>	 (03CR) 10Klausman: hiera/thanos-swift: Fix MinT user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman)
[14:12:13] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:12:41] <vgutierrez>	 ok.. ATSBackendErrorsHigh should recover any minute now
[14:12:42] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2288.codfw.wmnet
[14:12:43] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Ok I'll merge it as soon I get out of a meeting :)" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe)
[14:12:46] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2289.codfw.wmnet
[14:13:59] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:14:00] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: retry revert
[14:14:15] <jinxer-wm>	 RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:14:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 1.911% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:14:21] <logmsgbot>	 !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new plugins packages - bking@cumin1002 - T397227
[14:14:23] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[14:14:51] <jinxer-wm>	 RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[14:17:57] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2289.codfw.wmnet
[14:18:01] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2290.codfw.wmnet
[14:18:16] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 1.162s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:18:28] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: retry revert (duration: 04m 27s)
[14:18:28] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:20:45] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:20:47] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:22:13] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:23:18] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2290.codfw.wmnet
[14:23:21] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2291.codfw.wmnet
[14:24:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 22.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:25:03] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:26:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti6002.drmrs.wmnet with reason: host reimage
[14:28:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10968092 (10MoritzMuehlenhoff)
[14:28:32] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:28:49] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2291.codfw.wmnet
[14:28:52] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2292.codfw.wmnet
[14:29:10] <wikibugs>	 06SRE: HTTP 503 errors trying to reach Wikipedia - https://phabricator.wikimedia.org/T398448#10968098 (10Aklapper)
[14:29:15] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 3.676% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:29:34] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:30:00] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:30:08] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1400)
[14:30:08] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1430)
[14:30:45] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:31:17] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet
[14:31:27] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1048.eqiad.wmnet
[14:31:55] <logmsgbot>	 !log oblivian@deploy1003 Started scap sync-world: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]]
[14:32:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti6002.drmrs.wmnet with reason: host reimage
[14:34:08] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2292.codfw.wmnet
[14:34:11] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2293.codfw.wmnet
[14:34:15] <logmsgbot>	 !log oblivian@deploy1003 zabe, oblivian: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:34:15] <jinxer-wm>	 RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 5.903% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:35:01] <logmsgbot>	 !log oblivian@deploy1003 zabe, oblivian: Continuing with sync
[14:35:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10968106 (10MoritzMuehlenhoff)
[14:35:14] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:35:45] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:36:00] <jinxer-wm>	 RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:36:02] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:36:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:36:52] <jynus>	 ^ vgutierrez
[14:36:56] <jynus>	 probably this
[14:36:57] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:37:05] <vgutierrez>	 !incidents
[14:37:06] <sirenbot>	 6450 (UNACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[14:37:06] <sirenbot>	 6448 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[14:37:06] <sirenbot>	 6445 (RESOLVED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[14:37:07] <sirenbot>	 6449 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[14:37:07] <sirenbot>	 6447 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[14:37:07] <sirenbot>	 6446 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[14:37:12] <vgutierrez>	 !ack 6450
[14:37:13] <sirenbot>	 6450 (ACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[14:37:33] <_joe_>	 vgutierrez: please don't ack alerts we're not managing rn. that's unrelated to the current issue
[14:37:53] <vgutierrez>	 taking a look at that at the moment
[14:38:17] <_joe_>	 I'd prefer your eyeballs on the main issue
[14:38:19] <_joe_>	 :)
[14:38:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:38:32] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:38:33] <godog>	 I'll take over looking at thanos
[14:38:37] <vgutierrez>	 godog: thx
[14:38:39] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227
[14:38:40] <logmsgbot>	 !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227
[14:38:42] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[14:38:42] <godog>	 you got it vgutierrez 
[14:38:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:38:45] <logmsgbot>	 !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@1bb179b]: bump section topics to v1.6.0
[14:39:22] <logmsgbot>	 !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@1bb179b]: bump section topics to v1.6.0 (duration: 00m 47s)
[14:39:34] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:39:38] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2293.codfw.wmnet
[14:39:42] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2294.codfw.wmnet
[14:39:53] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10968153 (10MoritzMuehlenhoff)
[14:40:22] <logmsgbot>	 !log oblivian@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]] (duration: 08m 26s)
[14:41:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:42:32] <godog>	 !log bounce thanos-store on titan1002
[14:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:53] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[14:44:58] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2294.codfw.wmnet
[14:45:02] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2295.codfw.wmnet
[14:45:03] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slo: fix WDQS SLI [puppet] - 10https://gerrit.wikimedia.org/r/1165521 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey)
[14:45:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pyrra: rename "requests" to "availability" in the Istio SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1165525 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[14:45:19] <wikibugs>	 (03PS2) 10Elukey: pyrra: rename "requests" to "availability" in the Istio SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1165525 (https://phabricator.wikimedia.org/T391852)
[14:45:38] <wikibugs>	 (03PS4) 10Elukey: pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852)
[14:45:46] <wikibugs>	 (03PS2) 10Elukey: pyrra: add tonecheck Pyrra config [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706)
[14:47:03] <wikibugs>	 (03PS1) 10JMeybohm: sre.k8s.pool-depool-cluster: Exclude w[d,c]ws from repooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1165908 (https://phabricator.wikimedia.org/T341984)
[14:47:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 8.186% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:47:36] <logmsgbot>	 !log jiji@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=mw-api-ext-ro,name=eqiad
[14:48:04] <wikibugs>	 (03PS1) 10Elukey: profile::thanos::swift: rename machinetranslation account [labs/private] - 10https://gerrit.wikimedia.org/r/1165909
[14:48:28] <wikibugs>	 (03CR) 10Elukey: "filed also https://gerrit.wikimedia.org/r/c/labs/private/+/1165909" [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman)
[14:48:38] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pyrra: rename "requests" to "availability" in the Istio SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1165525 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[14:49:48] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[14:50:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:50:29] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2295.codfw.wmnet
[14:50:32] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2296.codfw.wmnet
[14:50:45] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:51:00] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:52:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.02% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:52:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6002.drmrs.wmnet with OS bookworm
[14:52:33] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10968198 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6002.drmrs.wmnet with OS bookworm completed: - ganeti6002 (**PASS*...
[14:52:35] <wikibugs>	 06SRE: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10968199 (10Aklapper)
[14:52:38] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[14:52:54] <wikibugs>	 (03PS1) 10Ahmon Dancy: data.yaml: Allow tailing of spiderpig jobrunner and apiserver journals [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945)
[14:53:21] <wikibugs>	 (03PS2) 10Ahmon Dancy: data.yaml: Allow tailing of spiderpig jobrunner and apiserver journals [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945)
[14:53:28] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:54:17] <wikibugs>	 (03CR) 10Aqu: [C:03+1] "We tweaked it on analytics-test. My experience was globally positive with a faster enqueuing of tasks for dagruns with many tasks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) (owner: 10Btullis)
[14:54:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet
[14:54:27] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[14:54:49] <wikibugs>	 (03CR) 10Aqu: [C:03+1] Bump resources and shared buffers config for postgresql-airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165899 (https://phabricator.wikimedia.org/T398421) (owner: 10Btullis)
[14:55:10] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:55:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165839 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[14:55:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 837.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:55:45] <jinxer-wm>	 RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:55:50] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2296.codfw.wmnet
[14:55:53] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2297.codfw.wmnet
[14:56:24] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc2014
[14:57:02] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1160117 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[14:57:37] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2014
[14:57:41] <wikibugs>	 (03PS1) 10Elukey: pyrra: rename class attribute for the citoid SLO [puppet] - 10https://gerrit.wikimedia.org/r/1165913
[15:00:09] <wikibugs>	 (03CR) 10Majavah: [C:03+2] natlog: Use a separate journald namespace with no storage [puppet] - 10https://gerrit.wikimedia.org/r/1160117 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[15:00:30] <wikibugs>	 (03PS2) 10Elukey: pyrra: rename class attribute for the citoid SLO [puppet] - 10https://gerrit.wikimedia.org/r/1165913
[15:00:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:01:21] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2297.codfw.wmnet
[15:01:22] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6126/console" [puppet] - 10https://gerrit.wikimedia.org/r/1165913 (owner: 10Elukey)
[15:01:24] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2298.codfw.wmnet
[15:02:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet
[15:02:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti6002.drmrs.wmnet to cluster drmrs02 and group B13
[15:03:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:03:37] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10968238 (10MoritzMuehlenhoff)
[15:03:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti6002.drmrs.wmnet to cluster drmrs02 and group B13
[15:04:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1165606 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French)
[15:04:40] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6127/console" [puppet] - 10https://gerrit.wikimedia.org/r/1165913 (owner: 10Elukey)
[15:05:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to drbd
[15:06:26] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.185.0" for 2 host(s)
[15:06:37] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2298.codfw.wmnet
[15:06:40] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2299.codfw.wmnet
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:59] <logmsgbot>	 !log jiji@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro,name=eqiad
[15:08:14] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.185.0" completed for 2 hosts
[15:11:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:11:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:11:51] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2299.codfw.wmnet
[15:11:55] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2300.codfw.wmnet
[15:12:28] <wikibugs>	 (03CR) 10Joal: [C:03+1] Increase max_tis_per_query for airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) (owner: 10Btullis)
[15:13:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:13:59] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:14:05] <wikibugs>	 (03CR) 10Joal: [C:03+1] "Not knowing defaults I don't know by how much we grow the available resources, but I think growing them is positive! +1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165899 (https://phabricator.wikimedia.org/T398421) (owner: 10Btullis)
[15:14:34] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host pc2014
[15:14:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2014
[15:14:44] <wikibugs>	 (03PS1) 10Majavah: hieradata: Enable NAT logging on both codfw1dev cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1165919 (https://phabricator.wikimedia.org/T273734)
[15:15:09] <vgutierrez>	 !log repool cp7006
[15:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow6001.drmrs.wmnet to drbd
[15:15:45] <icinga-wm>	 PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:12] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Enable NAT logging on both codfw1dev cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1165919 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[15:16:37] <icinga-wm>	 RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 87.53 ms
[15:16:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:17:17] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2300.codfw.wmnet
[15:17:20] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2301.codfw.wmnet
[15:18:45] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[15:20:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6002.drmrs.wmnet to drbd
[15:21:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:22:08] <wikibugs>	 (03PS1) 10Majavah: natlog: Add explicit dependency to file_line [puppet] - 10https://gerrit.wikimedia.org/r/1165921 (https://phabricator.wikimedia.org/T273734)
[15:22:37] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2301.codfw.wmnet
[15:22:41] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2302.codfw.wmnet
[15:26:11] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: rename class attribute for the citoid SLO [puppet] - 10https://gerrit.wikimedia.org/r/1165913 (owner: 10Elukey)
[15:26:57] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:27:44] <wikibugs>	 (03CR) 10Paladox: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[15:28:04] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2302.codfw.wmnet
[15:28:07] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2303.codfw.wmnet
[15:30:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6002.drmrs.wmnet to drbd
[15:30:47] <icinga-wm>	 PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:31:03] <icinga-wm>	 RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 87.48 ms
[15:31:57] <jinxer-wm>	 RESOLVED: [5x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:33:23] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2303.codfw.wmnet
[15:33:27] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2304.codfw.wmnet
[15:38:42] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2304.codfw.wmnet
[15:38:46] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2305.codfw.wmnet
[15:41:09] <jnuche>	 jouncebot: nowandnext
[15:41:09] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 18 minute(s)
[15:41:09] <jouncebot>	 In 1 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1700)
[15:42:42] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2305.codfw.wmnet
[15:42:46] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2306.codfw.wmnet
[15:44:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165894 (https://phabricator.wikimedia.org/T398413) (owner: 10Daimona Eaytoy)
[15:45:39] <wikibugs>	 (03CR) 10Cmelo: [C:03+1] Rename EventRegistration::$meetingAddress to $address for cache compat [extensions/CampaignEvents] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165894 (https://phabricator.wikimedia.org/T398413) (owner: 10Daimona Eaytoy)
[15:45:55] <wikibugs>	 (03CR) 10FNegri: [C:03+2] maintain-views: Use linktarget and collation in categorylinks view [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe)
[15:46:17] <wikibugs>	 (03Merged) 10jenkins-bot: Rename EventRegistration::$meetingAddress to $address for cache compat [extensions/CampaignEvents] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165894 (https://phabricator.wikimedia.org/T398413) (owner: 10Daimona Eaytoy)
[15:46:46] <logmsgbot>	 !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1165894|Rename EventRegistration::$meetingAddress to $address for cache compat (T398413)]]
[15:46:48] <stashbot>	 T398413: TypeError: Cannot assign string to property MediaWiki\Extension\CampaignEvents\Event\EventRegistration::$meetingAddress of type ?MediaWiki\Extension\CampaignEvents\Address\Address - https://phabricator.wikimedia.org/T398413
[15:47:58] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs4010.ulsfo.wmnet with reason: katran migration
[15:48:10] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:48:10] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2306.codfw.wmnet
[15:48:14] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2307.codfw.wmnet
[15:48:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs4010 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1165830 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[15:49:06] <logmsgbot>	 !log jnuche@deploy1003 jnuche, daimona: Backport for [[gerrit:1165894|Rename EventRegistration::$meetingAddress to $address for cache compat (T398413)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:49:56] <logmsgbot>	 !log jnuche@deploy1003 jnuche, daimona: Continuing with sync
[15:51:42] <wikibugs>	 (03PS2) 10Dzahn: remove legacy miscweb VM service names [dns] - 10https://gerrit.wikimedia.org/r/1165616 (https://phabricator.wikimedia.org/T397080)
[15:52:04] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "the decom cookbook has been executed on these" [dns] - 10https://gerrit.wikimedia.org/r/1165616 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[15:53:47] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2307.codfw.wmnet
[15:53:50] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2308.codfw.wmnet
[15:55:37] <logmsgbot>	 !log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165894|Rename EventRegistration::$meetingAddress to $address for cache compat (T398413)]] (duration: 08m 51s)
[15:55:39] <stashbot>	 T398413: TypeError: Cannot assign string to property MediaWiki\Extension\CampaignEvents\Event\EventRegistration::$meetingAddress of type ?MediaWiki\Extension\CampaignEvents\Address\Address - https://phabricator.wikimedia.org/T398413
[15:55:44] <wikibugs>	 (03Abandoned) 10Elukey: aux/dse: remove the usage of sha256 digest image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163713 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[15:55:57] <wikibugs>	 (03Abandoned) 10Elukey: profile::thanos::swift: rename machinetranslation account [labs/private] - 10https://gerrit.wikimedia.org/r/1165909 (owner: 10Elukey)
[15:56:08] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2308.codfw.wmnet
[15:56:11] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2309.codfw.wmnet
[15:56:37] <vgutierrez>	 !log switch lvs4010 to katran - 10.128.0.11
[15:56:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:45] <vgutierrez>	 wrong copy&pasta :)
[15:59:05] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Consolidate ulsfo liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1165928 (https://phabricator.wikimedia.org/T396561)
[15:59:36] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165928 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[16:01:23] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2309.codfw.wmnet
[16:01:26] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2310.codfw.wmnet
[16:01:50] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[16:02:41] <wikibugs>	 (03PS28) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826
[16:05:45] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Consolidate ulsfo liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1165928 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[16:06:33] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Bump resources and shared buffers config for postgresql-airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165899 (https://phabricator.wikimedia.org/T398421) (owner: 10Btullis)
[16:06:37] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2310.codfw.wmnet
[16:06:41] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2311.codfw.wmnet
[16:08:09] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[16:08:21] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Openstack web proxy: allow 'puppetencadmin' users to modify per-vm puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165155 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott)
[16:08:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Consolidate ulsfo liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1165928 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[16:10:00] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply
[16:10:04] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply
[16:12:02] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2311.codfw.wmnet
[16:12:06] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2312.codfw.wmnet
[16:13:14] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[16:13:54] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-cluster: Exclude w[d,c]ws from repooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1165908 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[16:17:23] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2312.codfw.wmnet
[16:17:27] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2313.codfw.wmnet
[16:21:41] <wikibugs>	 (03PS1) 10Clément Goubert: admin::data: Add access for antonkokhwmde [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917)
[16:21:41] <wikibugs>	 (03CR) 10Clément Goubert: "Verified out of band via slack" [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917) (owner: 10Clément Goubert)
[16:22:18] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10968656 (10Clement_Goubert)
[16:22:43] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2313.codfw.wmnet
[16:22:47] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2314.codfw.wmnet
[16:27:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack web proxy: allow 'proxyadmin' users to modify proxies [puppet] - 10https://gerrit.wikimedia.org/r/1165154 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott)
[16:27:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack web proxy: allow 'puppetencadmin' users to modify per-vm puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165155 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott)
[16:28:16] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2314.codfw.wmnet
[16:28:20] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2315.codfw.wmnet
[16:29:20] <wikibugs>	 (03PS2) 10Clément Goubert: admin::data: Add access for antonkokhwmde [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917)
[16:30:22] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Verified uid and access requirement." [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917) (owner: 10Clément Goubert)
[16:30:38] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] admin::data: Add access for antonkokhwmde [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917) (owner: 10Clément Goubert)
[16:33:32] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2315.codfw.wmnet
[16:33:35] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2316.codfw.wmnet
[16:34:03] <wikibugs>	 06SRE, 13Patch-For-Review: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10968700 (10Clement_Goubert) For the record, this is this incident https://www.wikimediastatus.net/incidents/57jsxtn7hlvf
[16:34:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Increase max_tis_per_query for airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) (owner: 10Btullis)
[16:35:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464 (10cmooney) 03NEW p:05Triage→03Low
[16:36:21] <wikibugs>	 (03Merged) 10jenkins-bot: Increase max_tis_per_query for airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) (owner: 10Btullis)
[16:37:22] <wikibugs>	 06SRE, 13Patch-For-Review: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10968724 (10Clement_Goubert) p:05Triage→03Medium Incident is resolved, setting medium priority for follow-up.
[16:39:03] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2316.codfw.wmnet
[16:39:07] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2317.codfw.wmnet
[16:39:40] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[16:40:10] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[16:40:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Prepare cloudcephmon nodes in codfw for ceph v16 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1165940 (https://phabricator.wikimedia.org/T306820)
[16:40:44] <wikibugs>	 (03PS1) 10Andrew Bogott: Prepare cloudcephosd nodes in codfw for ceph v16 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1165941 (https://phabricator.wikimedia.org/T306820)
[16:43:21] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2317.codfw.wmnet
[16:43:24] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2318.codfw.wmnet
[16:43:30] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:44:03] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165941 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott)
[16:44:07] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165940 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott)
[16:44:26] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:45:30] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10968745 (10Clement_Goubert) Shell access and kerberos principal created, i...
[16:45:57] <wikibugs>	 (03PS2) 10Volans: kubernetes: improve naming [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696)
[16:46:20] <wikibugs>	 (03CR) 10Volans: "following today's IRC discussion this is the final proposal with proper naming." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[16:47:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephmon nodes in codfw for ceph v16 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1165940 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott)
[16:47:52] <inflatador>	 !log bking@cumin1002 restarting cirrrussearch codfw T397227
[16:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:55] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[16:48:28] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: opensearch_1@production-search-omega-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:48:36] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2318.codfw.wmnet
[16:48:39] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2319.codfw.wmnet
[16:48:43] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cirrussearch2099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f835f0dd1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec
[16:48:43] <icinga-wm>	 dia.org/wiki/Search%23Administration
[16:50:43] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cirrussearch2099 is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: green, timed_out: False, number_of_nodes: 27, number_of_data_nodes: 27, discovered_master: True, active_primary_shards: 1710, active_shards: 5125, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number
[16:50:43] <icinga-wm>	 ing_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[16:53:40] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2319.codfw.wmnet
[16:53:43] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2320.codfw.wmnet
[16:58:55] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2320.codfw.wmnet
[16:58:59] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2321.codfw.wmnet
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1700)
[17:04:31] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2321.codfw.wmnet
[17:04:35] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2322.codfw.wmnet
[17:10:03] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2322.codfw.wmnet
[17:10:06] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2323.codfw.wmnet
[17:12:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:15:34] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2323.codfw.wmnet
[17:15:37] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2324.codfw.wmnet
[17:18:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:21:03] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2324.codfw.wmnet
[17:21:06] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2325.codfw.wmnet
[17:23:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:26:23] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2325.codfw.wmnet
[17:26:26] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2326.codfw.wmnet
[17:27:43] <wikibugs>	 (03CR) 10Scott French: [C:03+2] aptrepo: add pcre2-php83-bullseye to Update list [puppet] - 10https://gerrit.wikimedia.org/r/1165606 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French)
[17:28:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] remove legacy miscweb VM service names [dns] - 10https://gerrit.wikimedia.org/r/1165616 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[17:28:47] <logmsgbot>	 !log dzahn@dns1004 START - running authdns-update
[17:29:57] <logmsgbot>	 !log dzahn@dns1004 END - running authdns-update
[17:31:42] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2326.codfw.wmnet
[17:31:46] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2327.codfw.wmnet
[17:32:49] <wikibugs>	 (03PS1) 10Dzahn: miscweb: delete role and miscweb::httpd profile [puppet] - 10https://gerrit.wikimedia.org/r/1165955 (https://phabricator.wikimedia.org/T397080)
[17:34:03] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2327.codfw.wmnet
[17:34:06] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2328.codfw.wmnet
[17:36:28] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2328.codfw.wmnet
[17:36:31] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2329.codfw.wmnet
[17:40:25] <icinga-wm>	 PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[17:41:44] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2329.codfw.wmnet
[17:41:48] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2330.codfw.wmnet
[17:42:25] <icinga-wm>	 RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.011 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[17:47:00] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2330.codfw.wmnet
[17:52:41] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] miscweb: delete role and miscweb::httpd profile [puppet] - 10https://gerrit.wikimedia.org/r/1165955 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[17:53:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] miscweb: delete role and miscweb::httpd profile [puppet] - 10https://gerrit.wikimedia.org/r/1165955 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[17:53:45] <swfrench-wmf>	 !log reprepro update component/php83 with pcre2 10.42-1~wmf11+1 - T398245
[17:53:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:48] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[18:00:05] <jouncebot>	 jnuche and jeena: Time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1800).
[18:05:05] <wikibugs>	 (03CR) 10Ssingh: "Output for review:" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[18:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:11:23] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:11:25] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:11:27] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:12:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[18:18:47] <wikibugs>	 06SRE, 10MW-1.45-notes (1.45.0-wmf.9; 2025-07-08), 07Wikimedia-Incident: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10969186 (10Aklapper)
[18:20:29] <wikibugs>	 (03PS1) 10Samtar: labstore: Add dumpstorrents project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1165962 (https://phabricator.wikimedia.org/T398477)
[18:29:25] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:29:27] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:29:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd nodes in codfw for ceph v16 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1165941 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott)
[18:30:27] <wikibugs>	 (03PS2) 10Majavah: cloudnfs: Add dumpstorrents project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1165962 (https://phabricator.wikimedia.org/T398477) (owner: 10Samtar)
[18:31:11] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cloudnfs: Add dumpstorrents project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1165962 (https://phabricator.wikimedia.org/T398477) (owner: 10Samtar)
[18:32:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[18:42:15] <swfrench-wmf>	 !log reprepro include php8.3_8.3.22-1+wmf11u1 in component/php83  - T398245
[18:42:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:18] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[19:10:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:11:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#10969492 (10Jhancock.wm)
[19:11:23] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:18:09] <icinga-wm>	 PROBLEM - Host ssw1-d8-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:18:27] <icinga-wm>	 PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:18:27] <icinga-wm>	 PROBLEM - Host lsw1-d8-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:19:09] <icinga-wm>	 PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[19:19:15] <icinga-wm>	 PROBLEM - Host ps1-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[19:28:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch2092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[19:29:17] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2092 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:38:28] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch2092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:39:17] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2092 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:43:47] <wikibugs>	 (03CR) 10Hashar: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[19:47:35] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:04:13] <Krinkle>	 is anyone deploying?
[20:04:23] <wikibugs>	 (03PS2) 10Krinkle: beta: Switch excimer-ui-url service from wmflabs.org to wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163502 (https://phabricator.wikimedia.org/T289318)
[20:04:28] <wikibugs>	 (03PS4) 10Krinkle: beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318)
[20:04:32] <wikibugs>	 (03PS3) 10Krinkle: multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318)
[20:06:32] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[20:06:52] <Krinkle>	 !log krinkle@deploy1003:/srv/mediawiki$ git remote rm gerrit -- Fix `jforrester@gerrit.wikimedia.org: Permission denied (publickey).` There were two remotes: $ git remote -v gerrit  ssh://jforrester@gerrit origin  ssh://gerrit.wikimedia.org:29418
[20:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:37] <wikibugs>	 (03CR) 10Krinkle: [C:03+2] beta: Switch excimer-ui-url service from wmflabs.org to wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163502 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:09:29] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Switch excimer-ui-url service from wmflabs.org to wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163502 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:10:12] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[20:11:18] <wikibugs>	 (03PS5) 10Krinkle: beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318)
[20:11:46] <wikibugs>	 (03CR) 10Krinkle: [C:03+2] beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:11:50] <wikibugs>	 (03PS4) 10Krinkle: multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318)
[20:12:16] <wikibugs>	 (03CR) 10Krinkle: [C:03+2] multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:12:35] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:13:14] <wikibugs>	 (03Merged) 10jenkins-bot: multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:28:36] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Beta patches Iff58893f, I62b31535, I228d7766a57
[20:29:04] <swfrench-wmf>	 !log reprepro include php-defaults_94+wmf11u1 in component/php83 - T398245
[20:29:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:06] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[20:29:24] <wikibugs>	 (03CR) 10FNegri: [C:03+2] "After merging I realized this hasn't been +1d from the Data Engineering team, and they are the owners of maintain-views.yaml. [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe)
[20:30:40] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm
[20:31:42] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Beta patches Iff58893f, I62b31535, I228d7766a57 (duration: 03m 06s)
[20:32:30] <wikibugs>	 (03PS1) 10Krinkle: missing.php: Support beta suffix for auth.wikimedia error page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165983 (https://phabricator.wikimedia.org/T289318)
[20:33:46] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:33:52] <wikibugs>	 (03CR) 10Krinkle: missing.php: Support beta suffix for auth.wikimedia error page (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165983 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:34:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165983 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:34:12] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:34:18] <swfrench-wmf>	 !log reprepro include dh-php_5.5+wmf11u1 in component/php83 - T398245
[20:34:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:21] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[20:34:51] <wikibugs>	 (03Merged) 10jenkins-bot: missing.php: Support beta suffix for auth.wikimedia error page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165983 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[20:35:16] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1165983|missing.php: Support beta suffix for auth.wikimedia error page (T289318)]]
[20:35:18] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[20:36:08] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 6.474 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:36:46] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.566 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:37:25] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1165983|missing.php: Support beta suffix for auth.wikimedia error page (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:40:45] <wikibugs>	 (03CR) 10D3r1ck01: [C:03+1] "Made: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1165984" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108 (owner: 10Krinkle)
[20:42:28] <wikibugs>	 (03PS1) 10Cwhite: logstash: pass through normalized arrays from filter-on-template v1 [puppet] - 10https://gerrit.wikimedia.org/r/1165988 (https://phabricator.wikimedia.org/T234565)
[20:46:05] <wikibugs>	 (03PS1) 10Krinkle: beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318)
[20:47:22] <wikibugs>	 (03PS2) 10Krinkle: beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318)
[20:49:22] <wikibugs>	 (03CR) 10Bking: [C:03+1] hiera,cirrus: Enable IPIP on search*@codfw services [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez)
[20:49:58] <wikibugs>	 (03CR) 10Bking: [C:03+1] hiera,cirrus: Enable IPIP on search*@eqiad services [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez)
[20:50:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1206:9290 - https://phabricator.wikimedia.org/T397978#10969894 (10Jclark-ctr) 05Open→03Resolved Received replacement psu server has dual power
[20:51:06] <jinxer-wm>	 FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors
[20:51:38] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: pass through normalized arrays from filter-on-template v1 [puppet] - 10https://gerrit.wikimedia.org/r/1165988 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[20:53:04] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack common/servicetoken.erb: remove a misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/1143611
[20:55:04] <wikibugs>	 (03PS2) 10Krinkle: wmf-config: Fix filename typo in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108
[20:59:24] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[21:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T2100)
[21:04:25] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-06-23-151702 to 2025-07-02-122843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165903 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin)
[21:04:32] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-06-24-204920 to 2025-07-02-123323 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165890 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin)
[21:05:10] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165983|missing.php: Support beta suffix for auth.wikimedia error page (T289318)]] (duration: 29m 54s)
[21:05:14] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[21:08:54] <wikibugs>	 (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade evaluators from 2025-06-23-151702 to 2025-07-02-122843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165903 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin)
[21:08:58] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:10:45] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-06-23-151702 to 2025-07-02-122843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165903 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin)
[21:11:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:11:30] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:12:37] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:12:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:13:20] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[21:14:05] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[21:15:58] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[21:16:54] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[21:17:42] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply - PS2 Status - issue on wikikube-worker2320:9290 - https://phabricator.wikimedia.org/T398514 (10phaultfinder) 03NEW
[21:18:00] <wikibugs>	 (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-06-24-204920 to 2025-07-02-123323 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165890 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin)
[21:19:35] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-06-24-204920 to 2025-07-02-123323 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165890 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin)
[21:20:34] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:20:58] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:22:22] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[21:22:58] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[21:23:18] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[21:23:47] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[21:24:29] <wikibugs>	 (03PS1) 10Zabe: ApiQueryCategoryMembers: Use correct index for categorylinks [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165996 (https://phabricator.wikimedia.org/T385890)
[21:32:54] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161661 (owner: 10PipelineBot)
[21:32:57] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164168 (owner: 10PipelineBot)
[21:33:02] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165065 (owner: 10PipelineBot)
[21:33:05] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165150 (owner: 10PipelineBot)
[21:33:22] <wikibugs>	 (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162238 (owner: 10PipelineBot)
[21:33:41] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165496 (owner: 10PipelineBot)
[21:35:52] <wikibugs>	 (03PS1) 10Krinkle: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999
[21:36:30] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 302661744 and 25 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:36:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (owner: 10Krinkle)
[21:37:30] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 59624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:42:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108 (owner: 10Krinkle)
[21:42:48] <wikibugs>	 (03PS2) 10Krinkle: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999
[21:43:10] <wikibugs>	 (03Merged) 10jenkins-bot: wmf-config: Fix filename typo in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108 (owner: 10Krinkle)
[21:49:36] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[21:55:51] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[21:59:11] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T2200)
[22:02:21] <logmsgbot>	 jhathaway@cumin2002 provision (PID 4172609) is awaiting input
[22:02:30] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[22:04:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10970263 (10VRiley-WMF) We have received the Seed Server for this unit. Would we like to use a new/different name but set it up in the same location?
[22:07:53] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[22:08:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[22:08:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:11:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:12:43] <zabe>	 jouncebot: nowandnext
[22:12:43] <jouncebot>	 For the next 0 hour(s) and 47 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T2200)
[22:12:43] <jouncebot>	 In 7 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0600)
[22:12:43] <jouncebot>	 In 7 hour(s) and 47 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0600)
[22:12:46] <wikibugs>	 (03PS5) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565)
[22:12:50] <wikibugs>	 (03CR) 10Zabe: [C:03+2] ApiQueryCategoryMembers: Use correct index for categorylinks [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165996 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe)
[22:12:51] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:14:22] <wikibugs>	 (03PS6) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565)
[22:14:59] <wikibugs>	 (03CR) 10Dzahn: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[22:16:04] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] "Seems fine to me. If you can't find someone else to merge it, I'll be happy to." [puppet] - 10https://gerrit.wikimedia.org/r/1165526 (owner: 10D3r1ck01)
[22:16:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[22:16:48] <wikibugs>	 (03Merged) 10jenkins-bot: ApiQueryCategoryMembers: Use correct index for categorylinks [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165996 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe)
[22:17:33] <wikibugs>	 (03PS3) 10Krinkle: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (https://phabricator.wikimedia.org/T289318)
[22:17:47] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165996|ApiQueryCategoryMembers: Use correct index for categorylinks (T385890 T398448)]]
[22:17:51] <stashbot>	 T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890
[22:17:52] <stashbot>	 T398448: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448
[22:18:40] <wikibugs>	 (03PS7) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565)
[22:19:16] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:19:30] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[22:19:55] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1165996|ApiQueryCategoryMembers: Use correct index for categorylinks (T385890 T398448)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:19:59] <wikibugs>	 (03PS4) 10Krinkle: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (https://phabricator.wikimedia.org/T289318)
[22:20:37] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm
[22:20:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10970299 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm
[22:20:55] <wikibugs>	 (03CR) 10Krinkle: [C:03+2] beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[22:21:11] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[22:21:35] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[22:21:37] <wikibugs>	 (03CR) 10Krinkle: beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[22:25:30] <wikibugs>	 (03PS8) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565)
[22:26:59] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165996|ApiQueryCategoryMembers: Use correct index for categorylinks (T385890 T398448)]] (duration: 09m 12s)
[22:27:03] <stashbot>	 T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890
[22:27:04] <stashbot>	 T398448: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448
[22:27:23] <zabe>	 Krinkle: you can merge your patch now
[22:27:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10970307 (10VRiley-WMF)
[22:29:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=wdqs2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:29:51] <Krinkle>	 zabe: it's ok, I'll roll it out later. I've got a few errands to run first.
[22:30:00] <Krinkle>	 thx for the ping
[22:30:03] <zabe>	 alright
[22:33:41] <wikibugs>	 (03PS1) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936)
[22:36:28] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 85041MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[22:37:18] <sukhe>	 ryankemper: see the above page. wdqs2009 is acting up. I see a blazegraph restart in SAl
[22:37:21] <sukhe>	 L
[22:37:30] <sukhe>	 is that the recommended course of action?
[22:37:43] <sukhe>	 !incidents
[22:37:43] <sirenbot>	 6451 (UNACKED)  ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin)
[22:37:43] <sirenbot>	 6450 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[22:37:44] <sirenbot>	 6448 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[22:37:44] <sirenbot>	 6445 (RESOLVED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[22:37:44] <sirenbot>	 6449 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[22:37:44] <sirenbot>	 6447 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[22:37:44] <sirenbot>	 6446 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[22:37:50] <sukhe>	 !ack 6451
[22:37:51] <sirenbot>	 6451 (ACKED)  ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin)
[22:38:02] <ryankemper>	 sukhe: yes it is
[22:38:04] <sukhe>	 inflatador: see above as well
[22:38:09] <ryankemper>	 let me look into making service not page
[22:38:31] <sukhe>	 ryankemper: ok thanks. can you take care of it please? not really near a computer rn
[22:38:39] <sukhe>	 I acked the page 
[22:38:50] <ryankemper>	 yeah I've got it
[22:38:56] <swfrench-wmf>	 ah, I was just flagging in -sre, whoops
[22:39:17] <sukhe>	 ryankemper: <3
[22:39:26] <sukhe>	 thanks swfrench-wmf 
[22:39:37] <wikibugs>	 (03PS1) 10Cwhite: add docs for string_to_numeric_conversion_failure [software/ecs] - 10https://gerrit.wikimedia.org/r/1166008 (https://phabricator.wikimedia.org/T234565)
[22:40:00] <ryankemper>	 !log [WDQS] Restart wdqs-blazegraph on wdqs2009
[22:40:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:49] <ryankemper>	 So is this alert a generic one that will apply regardless of `page: false` being set in service.yaml?
[22:41:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10970332 (10VRiley-WMF) While trying to image these servers, it seems to lock up during the reboot with just a generic time out reason. Verified that the s...
[22:41:28] <ryankemper>	 Because ideally i don't want this host paging. It's a single wdqs full graph host that will be kept online for next few months for legacy reasons but we don't make any guarantees to users as to its availability
[22:43:28] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[22:44:23] <wikibugs>	 (03PS1) 10JHathaway: preseed: fix match for sretest [puppet] - 10https://gerrit.wikimedia.org/r/1166010
[22:44:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=wdqs2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:46:03] <sukhe>	 ryankemper: yeah pretty much. this is alerting because ATS is not happy with the backend
[22:47:25] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] preseed: fix match for sretest [puppet] - 10https://gerrit.wikimedia.org/r/1166010 (owner: 10JHathaway)
[22:47:37] <sukhe>	  https://github.com/wikimedia/operations-alerts/blob/master/team-sre/cdn.yaml
[22:49:37] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[22:49:43] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.186.0" for 2 host(s)
[22:50:40] <wikibugs>	 (03PS1) 10Arlolra: ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798)
[22:51:31] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.186.0" completed for 2 hosts
[22:51:53] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[23:01:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2009:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:02:08] <ryankemper>	 !log [WDQS] `ryankemper@wdqs2009:~$ sudo systemctl restart prometheus-blazegraph-exporter-wdqs-blazegraph.service`
[23:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:29] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[23:05:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=wdqs2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[23:05:52] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[23:07:28] <swfrench-wmf>	 !incidents
[23:07:28] <sirenbot>	 6452 (UNACKED)  ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin)
[23:07:29] <sirenbot>	 6451 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin)
[23:07:29] <sirenbot>	 6450 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[23:07:29] <sirenbot>	 6448 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet)
[23:07:29] <sirenbot>	 6445 (RESOLVED)  ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad)
[23:07:30] <sirenbot>	 6449 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[23:07:30] <sirenbot>	 6447 (RESOLVED)  VarnishUnavailable global sre (varnish-text thanos-rule)
[23:07:30] <sirenbot>	 6446 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[23:07:36] <swfrench-wmf>	 !ack 6452
[23:07:37] <sirenbot>	 6452 (ACKED)  ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin)
[23:07:39] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[23:08:17] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[23:08:18] <swfrench-wmf>	 ryankemper: thanks for responding during the previous instance of this. does the service need another restart, or is there some other mitigation needed?
[23:09:52] <swfrench-wmf>	 also yeah, as s.ukhe pointed out, this is decoupled from the `page: false` for various services defined in the catalog, which (IIUC) largely controls catalog-controlled monitoring, like probes
[23:10:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=wdqs2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[23:11:09] <swfrench-wmf>	 given the state of wdqs2009, would it make sense to add it to the exclusion regex in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/cdn.yaml ?
[23:11:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2009:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:11:44] <ryankemper>	 swfrench-wmf: absolutely
[23:12:36] <swfrench-wmf>	 ryankemper: great, let me open a task for that
[23:16:26] <wikibugs>	 06SRE: Suppress ATSBackendErrorsHigh for wdqs2009.codfw.wmnet - https://phabricator.wikimedia.org/T398523 (10Scott_French) 03NEW
[23:16:48] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: disable ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1166016 (https://phabricator.wikimedia.org/T398523)
[23:16:56] <ryankemper>	 swfrench-wmf: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1166016
[23:17:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:18:03] <swfrench-wmf>	 ah, awesome!
[23:18:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[23:19:52] <wikibugs>	 (03CR) 10Scott French: [C:03+1] wdqs: disable ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1166016 (https://phabricator.wikimedia.org/T398523) (owner: 10Ryan Kemper)
[23:20:20] <logmsgbot>	 jhathaway@cumin2002 reimage (PID 2100) is awaiting input
[23:21:24] <wikibugs>	 06SRE, 06Data-Platform-SRE, 13Patch-For-Review: Suppress ATSBackendErrorsHigh for wdqs2009.codfw.wmnet - https://phabricator.wikimedia.org/T398523#10970397 (10Scott_French)
[23:25:27] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm
[23:27:23] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: disable ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1166016 (https://phabricator.wikimedia.org/T398523) (owner: 10Ryan Kemper)
[23:28:32] <jinxer-wm>	 FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[23:28:38] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: disable ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1166016 (https://phabricator.wikimedia.org/T398523) (owner: 10Ryan Kemper)
[23:34:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:38:17] <tzatziki>	 !log removing 15 files for legal compliance
[23:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166023
[23:38:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166023 (owner: 10TrainBranchBot)
[23:40:51] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm
[23:40:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10970426 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed...
[23:49:46] <wikibugs>	 (03PS1) 10Dzahn: initial commit - add .gitreview file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166037 (https://phabricator.wikimedia.org/T268199)
[23:50:04] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] initial commit - add .gitreview file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166037 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[23:50:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166023 (owner: 10TrainBranchBot)
[23:53:01] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] initial commit - add .gitreview file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166037 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[23:53:23] <wikibugs>	 (03PS1) 10Dzahn: add initial blubber .pipeline config and a README [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166044 (https://phabricator.wikimedia.org/T268199)
[23:59:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed