[00:06:08] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965826 (10Jclark-ctr) @VRiley-WMF Was a Dell ticket opened for this server? We have two other servers experiencing the same issue, and it has now reoccurred. T383051 T397851 T397829 @Eevans [00:08:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#10965830 (10Jclark-ctr) @Clement_Goubert it has cleared for the time i am still working with dell since this seems to be reoccurring issues i... [00:08:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165630 [00:08:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165630 (owner: 10TrainBranchBot) [00:09:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10965831 (10Jclark-ctr) @Clement_Goubert it has cleared for the time i am still working with dell since this seems to be reoccurring issues if... [00:31:44] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1165630 (owner: 10TrainBranchBot) [00:32:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.122) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [00:32:45] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:37:45] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:46:28] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [00:52:45] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:53:44] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [00:57:45] RESOLVED: [2x] Traffic bill over quota: Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:05:35] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005 doesn't boot - https://phabricator.wikimedia.org/T398225#10965935 (10Jclark-ctr) @wiki_willy tagging you also for visibility. @Jhancock.wm @VRiley-WMF we should be opening tickets for this error with dell for these errors here is a quick list of servers... [02:07:36] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965949 (10Jclark-ctr) @MatthewVernon these are failing puppet do you need to set site.pp for insetup? [02:07:38] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965950 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1092.eqiad.wmnet with OS bullseye executed... [02:07:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10965951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1093.eqiad.wmnet with OS bullseye executed... [02:12:01] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [02:23:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:28:32] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [02:32:44] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [02:50:02] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [03:11:22] (03PS1) 10EggRoll97: Allow abusefilter block action on plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165635 (https://phabricator.wikimedia.org/T398137) [04:31:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:41:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:56:24] (03CR) 10Slyngshede: [V:03+2 C:03+2] Lock dulwich dependency at 0.22.1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1165025 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [05:57:23] !log slyngshede@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300 [05:57:26] T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300 [05:58:34] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300 [06:00:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0600) [06:02:57] !log slyngshede@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300 [06:03:00] T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300 [06:04:19] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300 [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:44] (03PS1) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) [06:12:16] (03CR) 10Arnaudb: "followed wikitech instructions to prep plugin installation via CI, let me know if anything else is required. I needed to edit the repo's ." [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:15:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool pc4 T378715', diff saved to https://phabricator.wikimedia.org/P78735 and previous config saved to /var/cache/conftool/dbconfig/20250702-061517-ladsgroup.json [06:15:20] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [06:23:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:28:22] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Switch to 10G (T378715) [06:28:25] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [06:29:30] !log dropping l10n_cache table everywhere (T397367) [06:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:32] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [06:31:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:32:39] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:33:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:35:33] (03PS1) 10Ayounsi: eqiad/codfw: remove more former TE [homer/public] - 10https://gerrit.wikimedia.org/r/1165732 (https://phabricator.wikimedia.org/T377844) [06:42:35] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [06:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:43:25] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.008 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [06:48:03] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [06:49:53] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 679 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [06:51:33] (03PS1) 10Muehlenhoff: Remove access for vpoundstone [puppet] - 10https://gerrit.wikimedia.org/r/1165767 [06:52:15] (03CR) 10CI reject: [V:04-1] Remove access for vpoundstone [puppet] - 10https://gerrit.wikimedia.org/r/1165767 (owner: 10Muehlenhoff) [06:52:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:53:58] (03PS2) 10Muehlenhoff: Remove access for vpoundstone [puppet] - 10https://gerrit.wikimedia.org/r/1165767 [06:54:39] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:58:31] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29772 bytes in 0.686 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:00:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [07:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0700). [07:00:05] EggRoll97: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] o/ [07:11:12] (03CR) 10Muehlenhoff: [C:03+2] Remove access for vpoundstone [puppet] - 10https://gerrit.wikimedia.org/r/1165767 (owner: 10Muehlenhoff) [07:12:21] (03CR) 10Filippo Giunchedi: [C:03+1] centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [07:18:22] Anyone available for deployment? (I'm not sure if I need to ask, but I haven't seen anyone yet) [07:20:44] (03CR) 10Hashar: [C:04-1] "None of our other plugins (ex: go-import, lfs, zuul) are mentioned in `.gitignore`. If I try it I get:" [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:27:03] I'm in a conference, can't deploy stuff today :( [07:27:33] Darn, thanks for letting me know though. urbanecm: are you available? [07:29:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:32] (03PS1) 10Muehlenhoff: Remove puppetserver1003/puppetserver2004 for maintenance [dns] - 10https://gerrit.wikimedia.org/r/1165815 [07:38:20] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti5007.eqsin.wmnet with reason: reimage [07:40:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5007.eqsin.wmnet with OS bookworm [07:40:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10966248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5007.eqsin.wmnet with OS bookworm [07:41:00] jouncebot now [07:41:00] For the next 12 hour(s) and 48 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250701T1430) [07:41:00] For the next 0 hour(s) and 18 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0700) [07:41:05] jouncebot refresh [07:41:06] I refreshed my knowledge about deployments. [07:41:11] jouncebot now [07:41:11] For the next 0 hour(s) and 18 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0700) [07:41:19] Better :) [07:48:32] (03CR) 10MVernon: [C:03+1] bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [07:49:32] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver1003/puppetserver2004 for maintenance [dns] - 10https://gerrit.wikimedia.org/r/1165815 (owner: 10Muehlenhoff) [07:49:39] !log jmm@dns1004 START - running authdns-update [07:50:41] !log jmm@dns1004 END - running authdns-update [07:52:56] (03CR) 10Jcrespo: [C:03+2] bacula: Remove oldmain and olddirector roles, prepare for decom backup[12]01 [puppet] - 10https://gerrit.wikimedia.org/r/1164981 (https://phabricator.wikimedia.org/T398185) (owner: 10Jcrespo) [07:53:36] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: start sampled traces from query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1165493 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi) [07:53:50] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi) [07:54:05] (03PS2) 10Filippo Giunchedi: thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) [07:54:25] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] thanos: notify services on tracing changes [puppet] - 10https://gerrit.wikimedia.org/r/1165494 (https://phabricator.wikimedia.org/T394414) (owner: 10Filippo Giunchedi) [07:54:25] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:14] (03Abandoned) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:00:05] jnuche and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0800). [08:00:34] (03Restored) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:01:25] morning, the train will roll out shortly [08:02:56] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetserver1003.eqiad.wmnet [08:03:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5007.eqsin.wmnet with reason: host reimage [08:04:42] (03PS2) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) [08:06:52] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165819 (https://phabricator.wikimedia.org/T392178) [08:06:54] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165819 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [08:07:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5007.eqsin.wmnet with reason: host reimage [08:07:46] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165819 (https://phabricator.wikimedia.org/T392178) (owner: 10TrainBranchBot) [08:10:28] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1003.eqiad.wmnet [08:10:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:13:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:14] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.8 refs T392178 [08:16:17] T392178: 1.45.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T392178 [08:18:52] (03PS2) 10Arnaudb: gerrit: add readonly plugin [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1165528 (https://phabricator.wikimedia.org/T387833) [08:20:52] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetserver2004.codfw.wmnet [08:24:17] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202842s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [08:27:08] (03CR) 10Ayounsi: "overall lgtm, not easy to do a thorough review." [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [08:28:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2004.codfw.wmnet [08:30:05] (03PS1) 10Muehlenhoff: Revert "Remove puppetserver1003/puppetserver2004 for maintenance" [dns] - 10https://gerrit.wikimedia.org/r/1165821 [08:32:05] (03CR) 10Filippo Giunchedi: "Maybe I'm missing something, though I'd expect cleanup to happen on service start so pyrra-filesystem starts with a blank slate." [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [08:32:54] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove puppetserver1003/puppetserver2004 for maintenance" [dns] - 10https://gerrit.wikimedia.org/r/1165821 (owner: 10Muehlenhoff) [08:33:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5007.eqsin.wmnet with OS bookworm [08:33:09] !log jmm@dns1004 START - running authdns-update [08:33:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10966375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5007.eqsin.wmnet with OS bookworm completed: - ganeti5007 (**PASS*... [08:33:27] (03CR) 10Filippo Giunchedi: pyrra-filesystem: clear output file on service stop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [08:34:12] !log jmm@dns1004 END - running authdns-update [08:34:57] (03CR) 10Cathal Mooney: "thanks for the review, few replies in line I will submit another patch later with those few updates." [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [08:43:11] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10966425 (10cmooney) >>! In T396396#10955048, @Andrew wrote: >>>! In T396396#10954940, @cmooney wrote: >> Folks you need to delete th... [08:43:23] (03PS1) 10Jcrespo: bacula: Remove backup1001 old backup director host from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1165822 (https://phabricator.wikimedia.org/T387892) [08:43:26] (03PS1) 10Jcrespo: bacula: Remove backup2001, old offsite backup host [puppet] - 10https://gerrit.wikimedia.org/r/1165823 (https://phabricator.wikimedia.org/T387892) [08:43:38] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [08:43:38] (03PS1) 10Jelto: miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165824 (https://phabricator.wikimedia.org/T398303) [08:46:45] (03CR) 10JMeybohm: [C:03+1] pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [08:47:33] (03CR) 10JMeybohm: [C:03+2] sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [08:47:35] (03CR) 10JMeybohm: [C:03+2] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [08:47:38] (03CR) 10JMeybohm: [C:03+2] k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [08:47:42] (03CR) 10Volans: [C:03+2] Upstream release v0.6.4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165006 (owner: 10Volans) [08:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [08:53:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [08:53:21] (03Merged) 10jenkins-bot: k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [08:54:39] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [08:55:19] (03PS2) 10Volans: debmonitor: add link to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/1164999 (https://phabricator.wikimedia.org/T397696) [08:55:20] (03PS3) 10Volans: debmonitor: use the new endpoint for the check [puppet] - 10https://gerrit.wikimedia.org/r/1164485 (https://phabricator.wikimedia.org/T397696) [08:55:20] (03PS1) 10Volans: debmonitor: fix debmonitor_servers hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1165825 (https://phabricator.wikimedia.org/T397696) [08:56:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5007.eqsin.wmnet to cluster eqsin and group 1 [08:57:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1165825 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:58:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5007.eqsin.wmnet to cluster eqsin and group 1 [08:59:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10966502 (10MoritzMuehlenhoff) [08:59:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10966505 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done! [09:00:05] (03Merged) 10jenkins-bot: sre.k8s.wipe-cluster: Downtime services [cookbooks] - 10https://gerrit.wikimedia.org/r/1165026 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [09:00:06] (03Merged) 10jenkins-bot: Upstream release v0.6.4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165006 (owner: 10Volans) [09:00:35] PROBLEM - puppetboard.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:01:01] ^^ expected? [09:01:25] RECOVERY - puppetboard.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:01:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#10966513 (10MoritzMuehlenhoff) [09:01:51] !log rebalance ganeti/eqsin following Bookworm reimages [09:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3006.esams.wmnet to cluster esams02 and group BW27 [09:04:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3006.esams.wmnet to cluster esams02 and group BW27 [09:04:48] 06SRE, 06Infrastructure-Foundations, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412 (10cmooney) 03NEW p:05Triage→03Low [09:04:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet [09:05:14] (03CR) 10Ayounsi: Switch BGP: Automate & unify IBGP configs on switches (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [09:05:24] !log jmm@dns1004 START - running authdns-update [09:05:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#10966569 (10ops-monitoring-bot) Draining ganeti3006.esams.wmnet of running VMs [09:06:23] !log jmm@dns1004 END - running authdns-update [09:06:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet [09:06:58] !log uploaded debmonitor-server,python3-debmonitor_0.6.4 to apt.wikimedia.org bookworm-wikimedia [09:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:07] there is some error on bacula config, I am debugging now [09:07:27] PROBLEM - bacula director process on backup1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:07:36] ^ this is the error [09:08:21] jouncebot: nowandnext [09:08:21] For the next 0 hour(s) and 51 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0800) [09:08:21] In 0 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1000) [09:08:32] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:43] (03CR) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [09:10:18] jnuche: do you currently need the window for the train or may I do a backport? [09:10:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:11:09] zabe: train is stable right now, please go ahead :) [09:11:18] thanks:) [09:11:34] (03PS1) 10Zabe: Fix categorylinks join order and use index on correct table [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165827 (https://phabricator.wikimedia.org/T398380) [09:11:42] FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:11:44] (03CR) 10Zabe: [C:03+2] Fix categorylinks join order and use index on correct table [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165827 (https://phabricator.wikimedia.org/T398380) (owner: 10Zabe) [09:12:13] found the issue with bacula director, a leftover from an old host [09:12:17] sending patch [09:13:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10966613 (10MoritzMuehlenhoff) [09:14:28] (03Merged) 10jenkins-bot: Fix categorylinks join order and use index on correct table [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165827 (https://phabricator.wikimedia.org/T398380) (owner: 10Zabe) [09:15:26] (03PS1) 10Jcrespo: bacula: Remove reference to old backup pool, now removed [puppet] - 10https://gerrit.wikimedia.org/r/1165828 (https://phabricator.wikimedia.org/T398188) [09:15:27] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165827|Fix categorylinks join order and use index on correct table (T398380)]] [09:15:31] T398380: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT page_id,page_namespace,page_title,page_is_redirect,page_len, - https://phabricator.wikimedia.org/T398380 [09:15:38] (03PS2) 10Jcrespo: bacula: Remove reference to old backup pool, now removed [puppet] - 10https://gerrit.wikimedia.org/r/1165828 (https://phabricator.wikimedia.org/T398188) [09:16:52] (03CR) 10Jcrespo: [C:03+2] bacula: Remove reference to old backup pool, now removed [puppet] - 10https://gerrit.wikimedia.org/r/1165828 (https://phabricator.wikimedia.org/T398188) (owner: 10Jcrespo) [09:17:31] (03PS1) 10Vgutierrez: hiera: Switch lvs4010 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1165830 (https://phabricator.wikimedia.org/T396561) [09:17:39] !log zabe@deploy1003 zabe: Backport for [[gerrit:1165827|Fix categorylinks join order and use index on correct table (T398380)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:18:23] !log zabe@deploy1003 zabe: Continuing with sync [09:18:52] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165830 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:19:17] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165563 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:19:48] (03PS1) 10Zabe: Reapply "categorylinks: Set group0 to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165831 (https://phabricator.wikimedia.org/T397912) [09:20:02] bacula should be healthy now [09:20:15] (03CR) 10Vgutierrez: [C:03+2] hiera: Consolidate katran config for magru [puppet] - 10https://gerrit.wikimedia.org/r/1165563 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:20:27] RECOVERY - bacula director process on backup1014 is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:23:53] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165827|Fix categorylinks join order and use index on correct table (T398380)]] (duration: 08m 26s) [09:23:56] T398380: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT page_id,page_namespace,page_title,page_is_redirect,page_len, - https://phabricator.wikimedia.org/T398380 [09:24:25] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:25] (03CR) 10Zabe: [C:03+2] Reapply "categorylinks: Set group0 to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165831 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [09:24:28] (03PS1) 10Hashar: gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) [09:24:37] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [09:24:54] (03CR) 10CI reject: [V:04-1] gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [09:25:18] (03Merged) 10jenkins-bot: Reapply "categorylinks: Set group0 to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165831 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [09:25:45] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165831|Reapply "categorylinks: Set group0 to read new" (T397912)]] [09:25:48] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [09:26:42] RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:27:53] !log zabe@deploy1003 zabe: Backport for [[gerrit:1165831|Reapply "categorylinks: Set group0 to read new" (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:28:38] !log zabe@deploy1003 zabe: Continuing with sync [09:29:18] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1093.eqiad.wmnet with OS bullseye [09:29:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966662 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1093.eqiad.wmnet with OS bullseye [09:30:03] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1092.eqiad.wmnet with OS bullseye [09:30:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1092.eqiad.wmnet with OS bullseye [09:31:42] FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet [09:36:01] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165831|Reapply "categorylinks: Set group0 to read new" (T397912)]] (duration: 10m 15s) [09:36:03] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [09:36:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10966680 (10ops-monitoring-bot) Draining ganeti6004.drmrs.wmnet of running VMs [09:36:42] RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:36:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet [09:37:44] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts backup2001.codfw.wmnet [09:38:42] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10966696 (10ayounsi) [09:39:39] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10966702 (10ayounsi) option 2 lgtm! [09:39:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to plain [09:40:28] (03PS2) 10Jcrespo: bacula: Remove backup2001, old offsite backup host [puppet] - 10https://gerrit.wikimedia.org/r/1165823 (https://phabricator.wikimedia.org/T387892) [09:40:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow6001.drmrs.wmnet to plain [09:40:53] (03PS5) 10Vgutierrez: acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) [09:40:59] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [09:42:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6002.drmrs.wmnet to plain [09:42:54] (03PS1) 10Giuseppe Lavagetto: Hotfixes release: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1165835 [09:42:59] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [09:43:06] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Hotfixes release: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1165835 (owner: 10Giuseppe Lavagetto) [09:43:46] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes: api auth and bwlimit rules - oblivian@cumin1003" [09:43:48] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes: api auth and bwlimit rules - oblivian@cumin1003 [09:43:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6002.drmrs.wmnet to plain [09:44:03] jouncebot: nowandnext [09:44:03] For the next 0 hour(s) and 15 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T0800) [09:44:03] In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1000) [09:44:17] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes: api auth and bwlimit rules - oblivian@cumin1003 [09:44:18] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes: api auth and bwlimit rules - oblivian@cumin1003" [09:44:23] jnuche: can I sync a patch to wmf.8? [09:45:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus6002.drmrs.wmnet to plain [09:45:40] (03PS1) 10Kosta Harlan: UserInfoCard: prevent default link behavior with "click" [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165836 (https://phabricator.wikimedia.org/T398323) [09:46:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus6002.drmrs.wmnet to plain [09:46:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6002.drmrs.wmnet to plain [09:47:13] !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [09:47:16] (03CR) 10Vgutierrez: [C:03+2] acme_chief: Don't issue RSA certs by default [puppet] - 10https://gerrit.wikimedia.org/r/1164418 (https://phabricator.wikimedia.org/T398020) (owner: 10Vgutierrez) [09:47:37] !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [09:47:37] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:47:38] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup2001.codfw.wmnet [09:47:43] (03CR) 10Ayounsi: [C:03+1] "2nd pass lgtm!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [09:47:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6002.drmrs.wmnet to plain [09:48:34] I assume it's OK, so I am proceeding [09:49:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165836 (https://phabricator.wikimedia.org/T398323) (owner: 10Kosta Harlan) [09:49:23] !log acme-chief: stop issuing RSA certificates by default - T398020 [09:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:25] T398020: Stop issuing RSA certificates - https://phabricator.wikimedia.org/T398020 [09:49:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6002.wikimedia.org to plain [09:50:01] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:50:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6002.wikimedia.org to plain [09:50:51] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1093.eqiad.wmnet with reason: host reimage [09:50:59] PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:51:12] (03CR) 10Jcrespo: [C:03+2] bacula: Remove backup2001, old offsite backup host [puppet] - 10https://gerrit.wikimedia.org/r/1165823 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [09:51:27] (03CR) 10Cathal Mooney: [C:03+1] eqiad/codfw: remove more former TE [homer/public] - 10https://gerrit.wikimedia.org/r/1165732 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [09:51:47] (03CR) 10Ayounsi: [C:03+2] eqiad/codfw: remove more former TE [homer/public] - 10https://gerrit.wikimedia.org/r/1165732 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [09:51:59] RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:52:18] (03Merged) 10jenkins-bot: eqiad/codfw: remove more former TE [homer/public] - 10https://gerrit.wikimedia.org/r/1165732 (https://phabricator.wikimedia.org/T377844) (owner: 10Ayounsi) [09:53:08] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti6004.drmrs.wmnet with reason: reimage [09:53:21] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission backup2001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398188#10966747 (10jcrespo) [09:53:43] (03PS3) 10Hashar: gerrit: config replicas for rename-project plugin [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) [09:53:44] (03CR) 10Hashar: "There are a few MediaWiki libraries I'd like to move `mediawiki/libs` (T125031). That would simplify the CI configuration in the new Zuul." [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [09:53:45] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission backup2001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398188#10966752 (10jcrespo) This is ready. Reminder it has 2 disks arrays attached. [09:54:05] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:54:12] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1093.eqiad.wmnet with reason: host reimage [09:54:30] (03PS1) 10Hnowlan: mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397970) [09:54:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti6004.drmrs.wmnet with OS bookworm [09:54:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10966769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6004.drmrs.wmnet with OS bookworm [09:55:02] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts backup1001.eqiad.wmnet [09:55:14] (03PS2) 10Volans: debmonitor: fix debmonitor_servers hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1165825 (https://phabricator.wikimedia.org/T397696) [09:56:59] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1092.eqiad.wmnet with reason: host reimage [09:58:32] (03Merged) 10jenkins-bot: UserInfoCard: prevent default link behavior with "click" [extensions/CheckUser] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165836 (https://phabricator.wikimedia.org/T398323) (owner: 10Kosta Harlan) [09:58:32] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:43] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165824 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [09:58:57] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1165836|UserInfoCard: prevent default link behavior with "click" (T398323)]] [09:59:00] T398323: UserInfoCard: Browser jumps to the top of the page when opening card - https://phabricator.wikimedia.org/T398323 [09:59:54] (03PS1) 10Volans: postinst: clear stale files [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165839 (https://phabricator.wikimedia.org/T397696) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1000) [10:00:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:31] (03CR) 10Volans: [C:03+2] debmonitor: fix debmonitor_servers hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1165825 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:00:58] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [10:01:20] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1165836|UserInfoCard: prevent default link behavior with "click" (T398323)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:01:23] (03PS1) 10Muehlenhoff: Update account settings for aude [puppet] - 10https://gerrit.wikimedia.org/r/1165840 [10:02:14] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1092.eqiad.wmnet with reason: host reimage [10:03:07] !log kharlan@deploy1003 kharlan: Continuing with sync [10:04:36] !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [10:04:53] !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [10:04:53] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:04:54] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup1001.eqiad.wmnet [10:07:29] (03PS2) 10Jcrespo: bacula: Remove backup1001 old backup director host from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1165822 (https://phabricator.wikimedia.org/T387892) [10:08:12] (03CR) 10Jcrespo: [C:03+2] bacula: Remove backup1001 old backup director host from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1165822 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [10:08:50] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165836|UserInfoCard: prevent default link behavior with "click" (T398323)]] (duration: 09m 52s) [10:08:52] T398323: UserInfoCard: Browser jumps to the top of the page when opening card - https://phabricator.wikimedia.org/T398323 [10:09:00] (03PS1) 10Majavah: P:toolforge::static: Put HAProxy in front of the Nginx instance [puppet] - 10https://gerrit.wikimedia.org/r/1165841 (https://phabricator.wikimedia.org/T397634) [10:09:11] done deploying [10:11:11] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission backup1001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398185#10966840 (10jcrespo) [10:11:26] (03PS2) 10Hnowlan: mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) [10:13:12] !log mvernon@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1003" [10:13:46] (03PS1) 10Vgutierrez: hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) [10:14:08] (03CR) 10Muehlenhoff: [C:03+2] Update account settings for aude [puppet] - 10https://gerrit.wikimedia.org/r/1165840 (owner: 10Muehlenhoff) [10:14:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti6004.drmrs.wmnet with reason: host reimage [10:14:31] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [10:15:53] jnuche: should we deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1165160 to fix the logspam for now? [10:16:16] mvernon@cumin1003 reimage (PID 4138533) is awaiting input [10:17:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti6004.drmrs.wmnet with reason: host reimage [10:17:41] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [10:18:46] !log mvernon@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1003" [10:18:46] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1093.eqiad.wmnet with OS bullseye [10:18:49] (03PS2) 10Majavah: P:toolforge::static: Put HAProxy in front of the Nginx instance [puppet] - 10https://gerrit.wikimedia.org/r/1165841 (https://phabricator.wikimedia.org/T397634) [10:18:49] (03PS1) 10Majavah: P:toolforge::static: Handle simple redirects in HAProxy config [puppet] - 10https://gerrit.wikimedia.org/r/1165843 (https://phabricator.wikimedia.org/T397634) [10:18:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1093.eqiad.wmnet with OS bullseye complete... [10:19:31] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 679 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [10:19:39] kostajh: a fix for that would be awesome, it's the largest single type of error in the logs right now [10:20:38] (03PS2) 10Vgutierrez: hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) [10:21:00] (03PS1) 10Zabe: maintain-views: Use linktarget and collation in categorylinks view [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) [10:21:05] !log mvernon@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1003" [10:21:13] (03Abandoned) 10Jforrester: FunctionEvaluator.vue: prod bug - js error for functions with Typed list as input param [extensions/WikiLambda] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163390 (https://phabricator.wikimedia.org/T397682) (owner: 10Jforrester) [10:21:22] !log mvernon@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1003" [10:21:23] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1092.eqiad.wmnet with OS bullseye [10:21:32] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1092.eqiad.wmnet with OS bullseye complete... [10:21:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [10:22:25] (03CR) 10Jelto: [C:03+2] miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165824 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [10:22:26] (03PS1) 10Klausman: ml-services/experimental: remove leftover editcheck services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165845 (https://phabricator.wikimedia.org/T397013) [10:23:00] (03CR) 10AikoChou: [C:03+1] ml-services/experimental: remove leftover editcheck services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165845 (https://phabricator.wikimedia.org/T397013) (owner: 10Klausman) [10:24:04] (03CR) 10Klausman: [C:03+2] ml-services/experimental: remove leftover editcheck services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165845 (https://phabricator.wikimedia.org/T397013) (owner: 10Klausman) [10:24:20] (03CR) 10Clément Goubert: [C:03+1] api-gateway: use more recent ratelimit image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165475 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [10:24:23] (03Merged) 10jenkins-bot: miscweb: bump another three miscweb images to bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165824 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [10:25:25] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:42] (03PS2) 10Zabe: maintain-views: Use linktarget and collation in categorylinks view [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) [10:26:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966891 (10MatthewVernon) @Jclark-ctr the problem with these two nodes was the same as we've had with every one of this batch of Dell servers - they arri... [10:26:11] (03Merged) 10jenkins-bot: ml-services/experimental: remove leftover editcheck services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165845 (https://phabricator.wikimedia.org/T397013) (owner: 10Klausman) [10:26:21] (03CR) 10Hnowlan: [C:03+2] "thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165475 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [10:26:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10966893 (10MatthewVernon) [10:26:44] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [10:27:06] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [10:27:28] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [10:27:47] (03PS1) 10Zabe: group1: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165846 (https://phabricator.wikimedia.org/T397912) [10:27:51] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [10:28:00] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [10:28:10] (03Merged) 10jenkins-bot: api-gateway: use more recent ratelimit image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165475 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [10:28:19] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [10:28:57] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [10:29:21] (03PS1) 10Volans: kubernetes: fine-tune displayed name [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) [10:29:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:30:31] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [10:32:09] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:33:19] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:33:56] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2248.codfw.wmnet [10:35:19] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:35:31] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:36:31] jnuche: we may make a different patch, later today [10:37:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#10966992 (10Clement_Goubert) We don't particularly need the node in production as we have spare capacity, if you need them depooled for testing w... [10:37:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10966999 (10Clement_Goubert) We don't particularly need the node in production as we have spare capacity, if you need them depooled for testing we... [10:38:06] (03PS1) 10Muehlenhoff: Switch mc-gp2004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1165849 [10:38:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165849 (owner: 10Muehlenhoff) [10:38:35] (03PS1) 10Elukey: admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850 [10:39:07] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2248.codfw.wmnet [10:39:10] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2249.codfw.wmnet [10:39:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6004.drmrs.wmnet with OS bookworm [10:39:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10967021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6004.drmrs.wmnet with OS bookworm completed: - ganeti6004 (**PASS*... [10:40:00] (03CR) 10Klausman: [C:03+1] admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850 (owner: 10Elukey) [10:40:19] (03CR) 10Elukey: [C:03+1] kubernetes: fine-tune displayed name [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:40:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:46] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [10:41:02] jnuche: ack, ty [10:42:05] (03PS2) 10Elukey: admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850 [10:42:32] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:42:54] (03CR) 10Klausman: [C:03+1] admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850 (owner: 10Elukey) [10:43:37] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [10:43:40] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:44:02] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:44:31] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2249.codfw.wmnet [10:44:34] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2250.codfw.wmnet [10:45:25] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:07] (03PS1) 10MVernon: hiera: add ms-be109[2-5] to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1165851 (https://phabricator.wikimedia.org/T391354) [10:46:10] (03PS1) 10MVernon: swift/eqiad: add ms-be109[2,3], drain ms-be1063 [puppet] - 10https://gerrit.wikimedia.org/r/1165852 (https://phabricator.wikimedia.org/T391354) [10:46:18] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 137236 [10:47:01] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:47:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 137236 [10:47:17] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:47:46] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [10:47:57] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:48:07] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 37271 [10:48:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37271 [10:48:55] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:14] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [10:49:59] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2250.codfw.wmnet [10:49:59] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [10:50:02] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2251.codfw.wmnet [10:51:13] (03CR) 10Jcrespo: [C:03+1] hiera: add ms-be109[2-5] to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1165851 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [10:51:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [10:52:04] (03CR) 10MVernon: [C:03+2] hiera: add ms-be109[2-5] to swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1165851 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [10:52:21] (03CR) 10Jcrespo: [C:03+1] swift/eqiad: add ms-be109[2,3], drain ms-be1063 [puppet] - 10https://gerrit.wikimedia.org/r/1165852 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [10:52:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:53:55] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:19] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2251.codfw.wmnet [10:55:22] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2252.codfw.wmnet [10:59:17] (03CR) 10MVernon: [C:03+2] swift/eqiad: add ms-be109[2,3], drain ms-be1063 [puppet] - 10https://gerrit.wikimedia.org/r/1165852 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [10:59:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [11:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1100). nyaa~ [11:00:28] (03PS1) 10Tiziano Fogli: pontoon: add fw rules to allow titan to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/1165855 [11:00:50] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2252.codfw.wmnet [11:00:54] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2253.codfw.wmnet [11:03:06] (03CR) 10Tiziano Fogli: [C:03+1] centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [11:04:25] FIRING: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:35] 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10967120 (10MatthewVernon) [11:04:39] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2253.codfw.wmnet [11:04:42] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2254.codfw.wmnet [11:06:55] FIRING: [2x] SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:58] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2254.codfw.wmnet [11:10:01] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2255.codfw.wmnet [11:12:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti6004.drmrs.wmnet to cluster drmrs02 and group B13 [11:14:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti6004.drmrs.wmnet to cluster drmrs02 and group B13 [11:15:39] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2255.codfw.wmnet [11:15:42] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2256.codfw.wmnet [11:16:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to drbd [11:16:32] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of netflow6001.drmrs.wmnet to drbd [11:16:41] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [11:17:33] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 679 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [11:18:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [11:19:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to drbd [11:19:44] (03CR) 10Sergio Gimeno: [C:03+1] "Not sure what's the appropriate way to merge this, backport?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152406 (https://phabricator.wikimedia.org/T395425) (owner: 10Urbanecm) [11:20:58] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2256.codfw.wmnet [11:21:02] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2257.codfw.wmnet [11:23:42] FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:26:30] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2257.codfw.wmnet [11:26:33] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2258.codfw.wmnet [11:28:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow6001.drmrs.wmnet to drbd [11:28:45] PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:25] RESOLVED: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:39] RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 87.52 ms [11:30:33] (03CR) 10Effie Mouzeli: [C:03+2] Switch mc-gp2004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1165849 (owner: 10Muehlenhoff) [11:31:49] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2258.codfw.wmnet [11:31:53] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2259.codfw.wmnet [11:31:55] RESOLVED: SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:32:53] FIRING: FNMNotReported: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [11:33:28] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2004.codfw.wmnet [11:33:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:33:42] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 37271 [11:35:22] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 37271 [11:37:05] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2259.codfw.wmnet [11:37:08] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2260.codfw.wmnet [11:37:53] RESOLVED: FNMNotReported: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [11:38:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus6002.drmrs.wmnet to drbd [11:40:03] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2004.codfw.wmnet [11:40:51] (03CR) 10Filippo Giunchedi: "LGTM, though please move the setting to modules/pontoon/files/settings/titan.yaml since alerting_host can function without titan and the v" [puppet] - 10https://gerrit.wikimedia.org/r/1165855 (owner: 10Tiziano Fogli) [11:41:33] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165486 (owner: 10Muehlenhoff) [11:42:20] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2260.codfw.wmnet [11:42:24] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2261.codfw.wmnet [11:42:32] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [11:43:40] (03CR) 10Jgiannelos: mobileapps: allow setting terminationGracePeriodSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:45:20] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10967282 (10Jclark-ctr) @MatthewVernon Thanks for assistance [11:45:36] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10967283 (10Jclark-ctr) 05Open→03Resolved [11:46:34] (03CR) 10Hnowlan: mobileapps: allow setting terminationGracePeriodSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:46:39] PROBLEM - Memcached on mc2050 is CRITICAL: connect to address 10.192.32.82 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:47:31] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:47:37] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2261.codfw.wmnet [11:47:40] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2262.codfw.wmnet [11:48:25] FIRING: SystemdUnitFailed: memcached.service on mc2050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:57] (03CR) 10Jgiannelos: [C:03+1] mobileapps: allow setting terminationGracePeriodSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:50:29] (03CR) 10Ladsgroup: [C:03+1] "LGTM but I'm in a conference and will have a hard time running the maintain views maybe Francesco can?" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [11:51:23] (03CR) 10FNegri: "Sure I'll do it!" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [11:51:32] (03CR) 10Clément Goubert: [C:03+1] mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:52:39] RECOVERY - Memcached on mc2050 is OK: TCP OK - 0.030 second response time on 10.192.32.82 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [11:52:56] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2262.codfw.wmnet [11:53:00] (03CR) 10Hnowlan: [C:03+2] mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:53:00] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2263.codfw.wmnet [11:53:25] RESOLVED: SystemdUnitFailed: memcached.service on mc2050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:25] FIRING: SystemdUnitFailed: user@499.service on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:40] (03Merged) 10jenkins-bot: mobileapps: allow setting terminationGracePeriodSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165838 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:55:27] FIRING: [3x] SLOMetricAbsent: wdqs-main-availability drmrs - https://slo.wikimedia.org/?search=wdqs-main-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:56:18] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1165830 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [11:58:37] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2263.codfw.wmnet [11:58:41] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2264.codfw.wmnet [12:00:36] PROBLEM - puppetboard.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:01:26] RECOVERY - puppetboard.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:01:40] PROBLEM - Host prometheus6002 is DOWN: PING CRITICAL - Packet loss = 100% [12:03:39] (03CR) 10Ladsgroup: [C:03+1] "Thank you! <3" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [12:04:08] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2264.codfw.wmnet [12:04:11] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2265.codfw.wmnet [12:04:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#10967364 (10Jclark-ctr) if you can deploy them that would be great so there is some load @Clement_Goubert [12:06:36] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:07:46] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:08:23] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:08:33] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:09:25] FIRING: [2x] SystemdUnitFailed: user@499.service on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:35] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2265.codfw.wmnet [12:09:38] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2266.codfw.wmnet [12:10:31] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:10:58] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:14:39] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:14:51] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2266.codfw.wmnet [12:14:54] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2267.codfw.wmnet [12:19:25] FIRING: [2x] SystemdUnitFailed: user@499.service on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:10] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2267.codfw.wmnet [12:20:14] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2268.codfw.wmnet [12:24:57] (03PS2) 10Tiziano Fogli: pontoon: add fw rules to allow titan to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/1165855 [12:25:30] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2268.codfw.wmnet [12:25:33] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2269.codfw.wmnet [12:25:49] (03CR) 10Tiziano Fogli: [C:03+2] pontoon: add fw rules to allow titan to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/1165855 (owner: 10Tiziano Fogli) [12:27:21] mmhh prometheus6002 is unhappy [12:27:23] checking [12:28:21] ah mmhh admin_down, moritzm known ? [12:28:27] prometheus6002.drmrs.wmnet kvm debootstrap+default ganeti6002.drmrs.wmnet ADMIN_down - [12:29:02] the host is being switch to DRBD as part of the bookworm update of ganeti/drmrs [12:29:25] RESOLVED: [2x] SystemdUnitFailed: user@499.service on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:29:25] drmrs has the unfortunate 2x2 design, so this is inevitable [12:29:26] (03CR) 10FNegri: [C:03+1] "@zabe@avorwerk.net I'll let you merge first, then I'll run maintain-views after this is merged." [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [12:29:35] should be done in 10 mins approx [12:29:53] ack thank you, I missed the drain-node invocation from earlier [12:29:56] all good [12:30:50] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2269.codfw.wmnet [12:30:53] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2270.codfw.wmnet [12:32:02] godog: drmrs is up for refresh in the next 12 months, then we'll create the new cluster with routed Ganeti, which doesn't have this issue [12:32:14] moritzm: neat! looking forward to that [12:36:19] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2270.codfw.wmnet [12:36:23] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2271.codfw.wmnet [12:41:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus6002.drmrs.wmnet to drbd [12:41:35] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2271.codfw.wmnet [12:41:39] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2272.codfw.wmnet [12:42:32] RECOVERY - Host prometheus6002 is UP: PING OK - Packet loss = 0%, RTA = 87.43 ms [12:44:28] (03PS1) 10Urbanecm: [Growth] Move Impact limit configuration to ext-GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165865 (https://phabricator.wikimedia.org/T341599) [12:44:30] (03PS1) 10Urbanecm: [Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165866 (https://phabricator.wikimedia.org/T398418) [12:45:27] RESOLVED: [3x] SLOMetricAbsent: wdqs-main-availability drmrs - https://slo.wikimedia.org/?search=wdqs-main-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:45:41] (03CR) 10Michael Große: [C:03+1] "Right, I completely missed that. Thanks for catching it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165865 (https://phabricator.wikimedia.org/T341599) (owner: 10Urbanecm) [12:45:55] (03CR) 10Michael Große: [C:03+1] [Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165866 (https://phabricator.wikimedia.org/T398418) (owner: 10Urbanecm) [12:46:37] jouncebot: nowandnext [12:46:38] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [12:46:38] In 0 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1300) [12:47:01] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2272.codfw.wmnet [12:47:04] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2273.codfw.wmnet [12:47:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165865 (https://phabricator.wikimedia.org/T341599) (owner: 10Urbanecm) [12:47:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165866 (https://phabricator.wikimedia.org/T398418) (owner: 10Urbanecm) [12:48:26] (03Merged) 10jenkins-bot: [Growth] Move Impact limit configuration to ext-GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165865 (https://phabricator.wikimedia.org/T341599) (owner: 10Urbanecm) [12:48:33] (03Merged) 10jenkins-bot: [Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165866 (https://phabricator.wikimedia.org/T398418) (owner: 10Urbanecm) [12:48:57] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1165865|[Growth] Move Impact limit configuration to ext-GrowthExperiments (T341599)]], [[gerrit:1165866|[Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 (T398418 T341599)]] [12:49:01] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [12:49:02] T398418: TypeError: array_map(): Argument #2 ($array) must be of type array, int given - https://phabricator.wikimedia.org/T398418 [12:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [12:51:16] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1165865|[Growth] Move Impact limit configuration to ext-GrowthExperiments (T341599)]], [[gerrit:1165866|[Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 (T398418 T341599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:51:52] (03CR) 10Zabe: "I do not have +2 for puppet" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [12:52:17] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2273.codfw.wmnet [12:52:21] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2274.codfw.wmnet [12:52:58] !log urbanecm@deploy1003 urbanecm: Continuing with sync [12:55:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission backup1001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398185#10967592 (10Jclark-ctr) a:03Jclark-ctr [12:55:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission backup1001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398185#10967594 (10Jclark-ctr) 05Open→03Resolved [12:55:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6002.drmrs.wmnet to drbd [12:56:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10967599 (10Jclark-ctr) @Stevemunene is this on hold by anything else in Eqiad? [12:57:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10967604 (10Jclark-ctr) a:03VRiley-WMF [12:57:37] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2274.codfw.wmnet [12:57:41] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2275.codfw.wmnet [12:58:40] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165865|[Growth] Move Impact limit configuration to ext-GrowthExperiments (T341599)]], [[gerrit:1165866|[Growth] enwiki: Decrease wgGEUserImpactMaxEdits to 1000 (T398418 T341599)]] (duration: 09m 42s) [12:58:43] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [12:58:44] T398418: TypeError: array_map(): Argument #2 ($array) must be of type array, int given - https://phabricator.wikimedia.org/T398418 [12:58:49] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10967608 (10Jclark-ctr) @MoritzMuehlenhoff is this still an issue could you verify again and we can try a different cable / brand cable if it is still slow? [12:59:00] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10967611 (10Jclark-ctr) a:03Jclark-ctr [13:00:04] Urbanecm and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1300). [13:00:04] EggRoll97: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10967614 (10Jclark-ctr) @RobH Can we close this task now that a decision has been made? [13:02:19] o/ [13:02:34] EggRoll97: o/ just reading the background on the patches a moment, given there's legal approval and the such :) [13:02:48] All good, sorry I'm a couple minutes late, I was having issues with IRC [13:02:53] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2275.codfw.wmnet [13:02:56] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2276.codfw.wmnet [13:03:42] FIRING: JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:04:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162158 (https://phabricator.wikimedia.org/T265726) (owner: 10EggRoll97) [13:04:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164637 (https://phabricator.wikimedia.org/T398107) (owner: 10EggRoll97) [13:04:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10967629 (10Jclark-ctr) Dell just responded they are going to send a new backplane for this device it will probably not arrive till Thursday /Satur... [13:05:22] (03Merged) 10jenkins-bot: Assign oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162158 (https://phabricator.wikimedia.org/T265726) (owner: 10EggRoll97) [13:05:24] (03Merged) 10jenkins-bot: Add abusefilter-revert to sysops on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164637 (https://phabricator.wikimedia.org/T398107) (owner: 10EggRoll97) [13:05:48] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1162158|Assign oathauth-verify-user to default bureaucrat (T265726)]], [[gerrit:1164637|Add abusefilter-revert to sysops on testwiki (T398107)]] [13:05:52] T265726: Assign oathauth-verify-user to bureaucrats on WMF wikis - https://phabricator.wikimedia.org/T265726 [13:05:53] T398107: Enable abusefilter-revert on testwiki - https://phabricator.wikimedia.org/T398107 [13:06:55] (03PS1) 10Muehlenhoff: Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1165874 [13:07:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6002.drmrs.wmnet to drbd [13:07:44] PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:48] RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 87.58 ms [13:08:05] !log samtar@deploy1003 samtar, eggroll97: Backport for [[gerrit:1162158|Assign oathauth-verify-user to default bureaucrat (T265726)]], [[gerrit:1164637|Add abusefilter-revert to sysops on testwiki (T398107)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:11] EggRoll97: those are both available to test on mwdebug [13:08:12] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2276.codfw.wmnet [13:08:16] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2277.codfw.wmnet [13:08:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6002.drmrs.wmnet to drbd [13:08:42] RESOLVED: JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:10:02] (03PS1) 10Jgreen: Adjust codfw payments hostnames after deprecating LVS servers. [dns] - 10https://gerrit.wikimedia.org/r/1165877 (https://phabricator.wikimedia.org/T398321) [13:10:36] TheresNoTime: seems fine from what I can see [13:11:26] (03CR) 10Jgreen: [V:03+1 C:03+2] Adjust codfw payments hostnames after deprecating LVS servers. [dns] - 10https://gerrit.wikimedia.org/r/1165877 (https://phabricator.wikimedia.org/T398321) (owner: 10Jgreen) [13:11:31] !log samtar@deploy1003 samtar, eggroll97: Continuing with sync [13:11:58] !log jgreen@dns1004 START - running authdns-update [13:12:06] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:13:06] !log jgreen@dns1004 END - running authdns-update [13:13:27] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2277.codfw.wmnet [13:13:31] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2278.codfw.wmnet [13:15:32] (03PS2) 10Andrew Bogott: cloudceph: move per-host puppet7 def to role [puppet] - 10https://gerrit.wikimedia.org/r/1165587 [13:15:32] (03PS1) 10Andrew Bogott: Include repo for ceph v16 'pacific' on cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1165878 (https://phabricator.wikimedia.org/T306820) [13:15:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165587 (owner: 10Andrew Bogott) [13:15:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165878 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott) [13:16:33] (03CR) 10Klausman: [C:03+1] amd-pytorch21: delete torch 2.1.2 + ROCm 5.6 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1164329 (owner: 10Ilias Sarantopoulos) [13:17:05] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1162158|Assign oathauth-verify-user to default bureaucrat (T265726)]], [[gerrit:1164637|Add abusefilter-revert to sysops on testwiki (T398107)]] (duration: 11m 16s) [13:17:13] T265726: Assign oathauth-verify-user to bureaucrats on WMF wikis - https://phabricator.wikimedia.org/T265726 [13:17:13] T398107: Enable abusefilter-revert on testwiki - https://phabricator.wikimedia.org/T398107 [13:17:22] EggRoll97: live on production :) [13:17:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6002.drmrs.wmnet to drbd [13:17:32] Yay, thanks TheresNoTime [13:17:41] PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:51] ^ expected, moritzm is working [13:18:01] !log installing rsyslog bugfix updates from Bookworm point release [13:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6002.wikimedia.org to drbd [13:18:36] RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 87.59 ms [13:18:43] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2278.codfw.wmnet [13:18:47] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2279.codfw.wmnet [13:19:58] PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:20:13] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10Mail: Add DMarcian trial-account address to the dmarc-ruf@wikimedia.org postfix mailing list - https://phabricator.wikimedia.org/T396062#10967738 (10Jgreen) [13:20:58] RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:21:23] <_joe_> !log depooling cp7006 for testing [13:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:48] (03PS3) 10Vgutierrez: hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) [13:23:48] (03PS1) 10Vgutierrez: hiera: Switch eqsin to the new upload cert [puppet] - 10https://gerrit.wikimedia.org/r/1165888 (https://phabricator.wikimedia.org/T394484) [13:24:06] (03CR) 10Andrew Bogott: "pcc failed but only because my wildcard didn't work." [puppet] - 10https://gerrit.wikimedia.org/r/1165587 (owner: 10Andrew Bogott) [13:24:08] (03CR) 10Andrew Bogott: [C:03+2] cloudceph: move per-host puppet7 def to role [puppet] - 10https://gerrit.wikimedia.org/r/1165587 (owner: 10Andrew Bogott) [13:24:15] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2279.codfw.wmnet [13:24:18] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2280.codfw.wmnet [13:24:26] (03CR) 10Andrew Bogott: [C:03+2] Include repo for ceph v16 'pacific' on cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1165878 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott) [13:24:26] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165888 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:24:42] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:49] TheresNoTime: are you done deploying? [13:25:01] zabe: yes sorry, forgot to say :) [13:25:35] no worries, just wanted to be sure [13:25:43] (03CR) 10Zabe: [C:03+2] group1: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165846 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [13:26:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6002.wikimedia.org to drbd [13:26:24] PROBLEM - Host doh6002 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:40] RECOVERY - Host doh6002 is UP: PING OK - Packet loss = 0%, RTA = 87.55 ms [13:26:53] (03PS1) 10David Martin: wikifunctions: Upgrade orchestrator from 2025-06-24-204920 to 2025-07-02-123323 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165890 (https://phabricator.wikimedia.org/T391208) [13:27:04] (03Merged) 10jenkins-bot: group1: Set categorylinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165846 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [13:27:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet [13:27:32] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165846|group1: Set categorylinks to read new (T397912)]] [13:27:34] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [13:27:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10967771 (10ops-monitoring-bot) Draining ganeti6002.drmrs.wmnet of running VMs [13:28:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet [13:28:41] PROBLEM - Bird Internet Routing Daemon on doh6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:29:42] RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:29:47] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2280.codfw.wmnet [13:29:51] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2281.codfw.wmnet [13:30:02] !log zabe@deploy1003 zabe: Backport for [[gerrit:1165846|group1: Set categorylinks to read new (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:30:06] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:30:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10967775 (10MoritzMuehlenhoff) [13:30:32] !log failover Ganeti master in drmrs02 to ganeti6004 T382513 [13:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:35] T382513: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513 [13:30:40] RECOVERY - Bird Internet Routing Daemon on doh6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:30:47] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442 (10Jgreen) 03NEW [13:30:54] !log zabe@deploy1003 zabe: Continuing with sync [13:31:23] (03CR) 10Ssingh: [C:03+1] hiera: Switch eqsin to the new upload cert [puppet] - 10https://gerrit.wikimedia.org/r/1165888 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:33:40] PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:34:42] zabe: if you're going to deploy something else too, can you take https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1152406 with you? [13:34:45] it's a docs-only patch [13:35:19] (03CR) 10Ssingh: "I think this is ready to ship IMO -- how have you tested this out so far? I want to give it a test run and then happy to +1!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:35:26] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2281.codfw.wmnet [13:35:30] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2282.codfw.wmnet [13:35:33] sure [13:37:01] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch eqsin to the new upload cert [puppet] - 10https://gerrit.wikimedia.org/r/1165888 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:37:40] !log switch upload@eqsin to the new upload cert - T394484 [13:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:07] T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484 [13:38:08] why is the "left: " counter increasing .. [13:38:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.702s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:39:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152406 (https://phabricator.wikimedia.org/T395425) (owner: 10Urbanecm) [13:39:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:39:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0.8663% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:39:32] hmm [13:39:34] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:44] that ain't good [13:40:05] (03Merged) 10jenkins-bot: [beta] docs: Document why weighed tags cannot be updated via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152406 (https://phabricator.wikimedia.org/T395425) (owner: 10Urbanecm) [13:40:24] I aborted scap [13:40:30] Will try another sync-world [13:40:37] let us see how that goes [13:40:52] !log zabe@deploy1003 Started scap sync-world: T397912 [13:40:58] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2282.codfw.wmnet [13:41:01] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2283.codfw.wmnet [13:41:35] (03PS1) 10Daimona Eaytoy: Rename EventRegistration::$meetingAddress to $address for cache compat [extensions/CampaignEvents] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165894 (https://phabricator.wikimedia.org/T398413) [13:41:44] zabe@deploy1003: Failed to log message to wiki. Somebody should check the error logs. [13:41:44] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [13:42:22] "MediaWiki\Exception\UserNotLoggedIn: Please log in to be able to access this page or action." spike [13:42:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:15] claime: uhoh. i know what that is... [13:43:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:43:26] deployment-related? [13:43:45] !incidents [13:43:45] 6445 (UNACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [13:43:47] I guess so, based on backlog [13:43:49] !ack 6445 [13:43:49] 6445 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [13:43:57] actually that's not what's spiking [13:44:01] jynus: if UserNotLoggedIn is the cause, it'd say it's traffic related, but...i did not look at any logs [13:44:14] ok [13:44:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:44:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:44:51] jobqueue issues Oo [13:44:54] getting some errors, but it looks like already looking into it [13:44:56] !log zabe@deploy1003 sync-world aborted: T397912 (duration: 04m 03s) [13:45:02] (03PS1) 10Zabe: Revert "group1: Set categorylinks to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165897 [13:45:07] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: activate new plugins packages - bking@cumin1002 - T397227 [13:45:08] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: activate new plugins packages - bking@cumin1002 - T397227 [13:45:08] +query errors [13:45:24] It started showing up during deploying that one [13:45:34] a lot of `Error: 2006 MySQL server has gone away` it seems [13:45:37] Error: 2006 MySQL server has gone away [13:45:39] yeah [13:45:43] (03CR) 10Zabe: [V:03+2 C:03+2] Revert "group1: Set categorylinks to read new" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165897 (owner: 10Zabe) [13:45:50] on commons [13:46:07] i'm also getting this from api.php on enwiki: Original error: upstream connect error or disconnect/reset before headers. reset reason: connection failure [13:46:08] Although I do not really see the connection to jobqueu [13:46:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6002.wikimedia.org to plain [13:46:13] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2283.codfw.wmnet [13:46:16] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2284.codfw.wmnet [13:46:20] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new plugins packages - bking@cumin1002 - T397227 [13:46:22] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]] [13:46:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6002.wikimedia.org to plain [13:47:09] we'll wait and see if the revert roll out calms things down [13:47:20] (situations like this make me ask "is there a faster way to sync something than `wait 10 mins`") [13:47:22] I think the convo should be moved to -sre, and anyone involved report what they know [13:47:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6002.drmrs.wmnet to plain [13:48:00] I think my patches caused some slow queries which overloaded commons db? [13:48:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:48:41] zabe: #wikimedia-sre please :) [13:48:42] PROBLEM - Bird Internet Routing Daemon on doh6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:48:49] moritzm: ^ should I downtime these? [13:48:50] !log zabe@deploy1003 zabe: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:49:06] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:49:09] zabe: possible [13:49:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:49:23] 350 million queries per second on commons [13:49:27] that cannot be handled [13:49:42] !log zabe@deploy1003 zabe: Continuing with sync [13:49:46] jynus: -sre please [13:49:51] effie: otoh, https://wikitech.wikimedia.org/wiki/Backport_windows says deployment-related convo should happen in here... [13:50:25] jmm@cumin2002 changedisk (PID 4005626) is awaiting input [13:50:27] sukhe: each should resolve within 30seconds, so should be fine [13:50:30] ok :) [13:50:40] likewise for durum6002, which is incoming [13:50:40] RECOVERY - Bird Internet Routing Daemon on doh6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:51:00] ok, but I do not really see how my patch could increase the number of queries, only how it could make them slow [13:51:14] urbanecm: yeah but it's impossible to follow with botnoise [13:51:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6002.drmrs.wmnet to plain [13:51:28] so sre debugging goes to -sre for the moment [13:51:31] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2284.codfw.wmnet [13:51:35] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2285.codfw.wmnet [13:51:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:51:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:51:54] wut? [13:51:56] <_joe_> that's my fault [13:51:57] !incidents [13:51:57] 6445 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [13:51:57] 6446 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:52:01] <_joe_> if it's magru [13:52:02] !ack 6446 [13:52:03] 6446 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:52:11] _joe_: how? [13:52:12] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:52:12] <_joe_> ah no if it's everything then it's not [13:52:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6002.drmrs.wmnet to plain [13:52:25] <_joe_> vgutierrez: I briefly repooled the server, for like 3 minutes [13:52:31] oh :D [13:52:40] 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2025/2026-Q1): Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#10967929 (10lmata) [13:52:41] <_joe_> but tbh this seems to be related to the api issues [13:52:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [13:52:48] <_joe_> yeeep [13:52:53] !incidents [13:52:53] 6445 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [13:52:53] 6446 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:52:54] 6447 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [13:52:57] <_joe_> I assume the oncall people are looking into it [13:52:57] !ack 6447 [13:52:57] 6447 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [13:52:58] PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:53:13] <_joe_> ah that would be you vgutierrez, sorry, have fun [13:53:44] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:53:55] _joe_: I'm here to coordinate it, not solve it :D [13:54:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 6.212 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:12] claime: are you still waiting on the rollback? [13:54:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6002.drmrs.wmnet to plain [13:54:27] yeah [13:54:29] it's ongoing [13:54:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:34] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:38] 13:54:26 K8s deployment progress: 85% (ok: 1948; fail: 0; left: 321) \ [13:54:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:54:58] RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:55:00] !incidents [13:55:00] 6445 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [13:55:00] 6446 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:55:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to plain [13:55:00] 6447 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [13:55:01] 6448 (UNACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [13:55:03] !ack 6448 [13:55:03] 6448 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [13:55:04] !ack 6448 [13:55:04] 6448 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [13:55:06] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:55:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow6001.drmrs.wmnet to plain [13:55:59] (03PS1) 10Btullis: Bump resources and shared buffers config for postgresql-airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165899 (https://phabricator.wikimedia.org/T398421) [13:56:00] (03PS1) 10Btullis: Increase max_tis_per_query for airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) [13:56:04] 5xx in ATS are already starting to decrease [13:56:25] is that a good thing? [13:56:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:56:47] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2285.codfw.wmnet [13:56:50] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2286.codfw.wmnet [13:56:58] Gommeh: yes [13:57:20] (03PS1) 10Elukey: profile::thanos::swift: rework machinetranslation account [puppet] - 10https://gerrit.wikimedia.org/r/1165901 (https://phabricator.wikimedia.org/T335491) [13:57:33] ATS is the cache layer that speaks to the applayer and it was recording an unexpected high number of 5xx from mw-api-ext-ro [13:57:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [13:57:51] (03Abandoned) 10Elukey: profile::thanos::swift: rework machinetranslation account [puppet] - 10https://gerrit.wikimedia.org/r/1165901 (https://phabricator.wikimedia.org/T335491) (owner: 10Elukey) [13:58:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:58:17] vgutierrez english please [13:58:23] new to this lol [13:59:51] FIRING: [8x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:00:07] !incidents [14:00:08] 6445 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [14:00:08] 6448 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [14:00:08] 6447 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [14:00:08] 6446 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:00:08] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1400) [14:00:26] claime: 5xx back again to ~3.5k rps [14:00:43] everything on mw-api-ext-ro [14:00:48] <_joe_> being okta'd during incident response: priceless [14:00:57] Gommeh: ping me later after the incident ends :) [14:00:57] <_joe_> yes there isn't one pod that's ready in eqiad [14:01:16] (03CR) 10Elukey: "Today I discovered that `swift post -r` can grant to multiple users the read ACLs, and the new config seems more inline with what we need " [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [14:01:26] <_joe_> which makes me thing it's not just zabe's patch [14:01:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus6002.drmrs.wmnet to plain [14:01:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:01:46] <_joe_> but, to allow for systems to recover, shouldd we ban all requests to the action api for commons? [14:02:00] !incidents [14:02:00] 6445 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [14:02:00] 6448 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [14:02:01] 6449 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:02:01] 6447 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [14:02:01] 6446 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:02:02] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2286.codfw.wmnet [14:02:04] !ack 6449 [14:02:04] 6449 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:02:06] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2287.codfw.wmnet [14:02:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus6002.drmrs.wmnet to plain [14:03:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:03:29] mysql seems to be onfire since 13:30 in terms of rows read [14:03:35] (03PS2) 10Cory Massaro: wikifunctions: Enable batching in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156430 [14:03:39] (03Abandoned) 10Jforrester: wikifunctions: Enable batching in the orchestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156430 (owner: 10Cory Massaro) [14:04:34] RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:51] FIRING: [8x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:04:57] !incidents [14:04:57] 6445 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [14:04:57] 6448 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [14:04:57] 6449 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:04:58] 6447 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [14:04:58] 6446 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:05:03] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:05:44] 10SRE-SLO, 10observability, 10SRE Observability (FY2025/2026-Q1): Add a banner to slo.wikimedia.org explaining rolling vs calendar views - https://phabricator.wikimedia.org/T398313#10967982 (10lmata) [14:06:19] 10SRE-SLO, 10observability, 10SRE Observability (FY2025/2026-Q1): Add links in the Pyrra rolling dashboards to point to their calendar ones in Grafana - https://phabricator.wikimedia.org/T398311#10967984 (10lmata) [14:06:24] (03CR) 10Urbanecm: "scap backport when no one is doing anything is an appropriate action in this case (since it is a labs-only change, it will amount to git p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152406 (https://phabricator.wikimedia.org/T395425) (owner: 10Urbanecm) [14:06:38] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti6002.drmrs.wmnet with reason: reimage [14:06:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:07:22] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2287.codfw.wmnet [14:07:26] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2288.codfw.wmnet [14:07:57] RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:23] ok... 5xx down to ~1k rps [14:08:28] FIRING: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:32] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti6002.drmrs.wmnet with OS bookworm [14:08:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10967993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6002.drmrs.wmnet with OS bookworm [14:09:06] (03PS1) 10David Martin: wikifunctions: Upgrade evaluator from 2025-06-23-151702 to 2025-07-02-122843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165903 (https://phabricator.wikimedia.org/T391208) [14:09:34] RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:10:18] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10967996 (10MoritzMuehlenhoff) [14:10:47] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:11:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:47] (03PS5) 10Klausman: hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) [14:11:56] (03CR) 10Klausman: hiera/thanos-swift: Fix MinT user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [14:12:13] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:12:41] ok.. ATSBackendErrorsHigh should recover any minute now [14:12:42] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2288.codfw.wmnet [14:12:43] (03CR) 10FNegri: [C:03+1] "Ok I'll merge it as soon I get out of a meeting :)" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [14:12:46] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2289.codfw.wmnet [14:13:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:14:00] !log zabe@deploy1003 Started scap sync-world: retry revert [14:14:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 1.911% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:14:21] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new plugins packages - bking@cumin1002 - T397227 [14:14:23] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [14:14:51] RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:17:57] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2289.codfw.wmnet [14:18:01] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2290.codfw.wmnet [14:18:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 1.162s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:18:28] !log zabe@deploy1003 Finished scap sync-world: retry revert (duration: 04m 27s) [14:18:28] RESOLVED: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:45] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:20:47] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:22:13] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:18] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2290.codfw.wmnet [14:23:21] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2291.codfw.wmnet [14:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 22.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:25:03] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:26:49] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti6002.drmrs.wmnet with reason: host reimage [14:28:08] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10968092 (10MoritzMuehlenhoff) [14:28:32] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:28:49] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2291.codfw.wmnet [14:28:52] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2292.codfw.wmnet [14:29:10] 06SRE: HTTP 503 errors trying to reach Wikipedia - https://phabricator.wikimedia.org/T398448#10968098 (10Aklapper) [14:29:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 3.676% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:29:34] RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:00] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:30:08] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1400) [14:30:08] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1430) [14:30:45] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:31:17] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [14:31:27] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1048.eqiad.wmnet [14:31:55] !log oblivian@deploy1003 Started scap sync-world: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]] [14:32:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti6002.drmrs.wmnet with reason: host reimage [14:34:08] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2292.codfw.wmnet [14:34:11] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2293.codfw.wmnet [14:34:15] !log oblivian@deploy1003 zabe, oblivian: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:34:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 5.903% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:35:01] !log oblivian@deploy1003 zabe, oblivian: Continuing with sync [14:35:10] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10968106 (10MoritzMuehlenhoff) [14:35:14] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:35:45] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:36:00] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:36:02] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:36:43] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:36:52] ^ vgutierrez [14:36:56] probably this [14:36:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:05] !incidents [14:37:06] 6450 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [14:37:06] 6448 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [14:37:06] 6445 (RESOLVED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [14:37:07] 6449 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:37:07] 6447 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [14:37:07] 6446 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:37:12] !ack 6450 [14:37:13] 6450 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [14:37:33] <_joe_> vgutierrez: please don't ack alerts we're not managing rn. that's unrelated to the current issue [14:37:53] taking a look at that at the moment [14:38:17] <_joe_> I'd prefer your eyeballs on the main issue [14:38:19] <_joe_> :) [14:38:28] FIRING: [2x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:32] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:33] I'll take over looking at thanos [14:38:37] godog: thx [14:38:39] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [14:38:40] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - bking@cumin1002 - T397227 [14:38:42] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [14:38:42] you got it vgutierrez [14:38:43] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:38:45] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@1bb179b]: bump section topics to v1.6.0 [14:39:22] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@1bb179b]: bump section topics to v1.6.0 (duration: 00m 47s) [14:39:34] RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:38] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2293.codfw.wmnet [14:39:42] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2294.codfw.wmnet [14:39:53] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10968153 (10MoritzMuehlenhoff) [14:40:22] !log oblivian@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165897|Revert "group1: Set categorylinks to read new"]] (duration: 08m 26s) [14:41:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:32] !log bounce thanos-store on titan1002 [14:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:53] (03CR) 10Elukey: [C:03+2] pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:44:58] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2294.codfw.wmnet [14:45:02] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2295.codfw.wmnet [14:45:03] (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slo: fix WDQS SLI [puppet] - 10https://gerrit.wikimedia.org/r/1165521 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [14:45:10] (03CR) 10Elukey: [C:03+2] pyrra: rename "requests" to "availability" in the Istio SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1165525 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [14:45:19] (03PS2) 10Elukey: pyrra: rename "requests" to "availability" in the Istio SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1165525 (https://phabricator.wikimedia.org/T391852) [14:45:38] (03PS4) 10Elukey: pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) [14:45:46] (03PS2) 10Elukey: pyrra: add tonecheck Pyrra config [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706) [14:47:03] (03PS1) 10JMeybohm: sre.k8s.pool-depool-cluster: Exclude w[d,c]ws from repooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1165908 (https://phabricator.wikimedia.org/T341984) [14:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 8.186% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:47:36] !log jiji@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=mw-api-ext-ro,name=eqiad [14:48:04] (03PS1) 10Elukey: profile::thanos::swift: rename machinetranslation account [labs/private] - 10https://gerrit.wikimedia.org/r/1165909 [14:48:28] (03CR) 10Elukey: "filed also https://gerrit.wikimedia.org/r/c/labs/private/+/1165909" [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [14:48:38] (03CR) 10Elukey: [C:03+2] pyrra: rename "requests" to "availability" in the Istio SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1165525 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [14:49:48] (03CR) 10Elukey: [C:03+2] pyrra: add experimental success ratio template for istio [puppet] - 10https://gerrit.wikimedia.org/r/1165539 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [14:50:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:50:29] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2295.codfw.wmnet [14:50:32] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2296.codfw.wmnet [14:50:45] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:51:00] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.02% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:52:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6002.drmrs.wmnet with OS bookworm [14:52:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10968198 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6002.drmrs.wmnet with OS bookworm completed: - ganeti6002 (**PASS*... [14:52:35] 06SRE: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10968199 (10Aklapper) [14:52:38] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [14:52:54] (03PS1) 10Ahmon Dancy: data.yaml: Allow tailing of spiderpig jobrunner and apiserver journals [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) [14:53:21] (03PS2) 10Ahmon Dancy: data.yaml: Allow tailing of spiderpig jobrunner and apiserver journals [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) [14:53:28] RESOLVED: [2x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:17] (03CR) 10Aqu: [C:03+1] "We tweaked it on analytics-test. My experience was globally positive with a faster enqueuing of tasks for dagruns with many tasks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) (owner: 10Btullis) [14:54:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [14:54:27] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [14:54:49] (03CR) 10Aqu: [C:03+1] Bump resources and shared buffers config for postgresql-airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165899 (https://phabricator.wikimedia.org/T398421) (owner: 10Btullis) [14:55:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1165839 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [14:55:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 837.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:55:45] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:55:50] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2296.codfw.wmnet [14:55:53] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2297.codfw.wmnet [14:56:24] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc2014 [14:57:02] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1160117 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [14:57:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2014 [14:57:41] (03PS1) 10Elukey: pyrra: rename class attribute for the citoid SLO [puppet] - 10https://gerrit.wikimedia.org/r/1165913 [15:00:09] (03CR) 10Majavah: [C:03+2] natlog: Use a separate journald namespace with no storage [puppet] - 10https://gerrit.wikimedia.org/r/1160117 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [15:00:30] (03PS2) 10Elukey: pyrra: rename class attribute for the citoid SLO [puppet] - 10https://gerrit.wikimedia.org/r/1165913 [15:00:36] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:01:21] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2297.codfw.wmnet [15:01:22] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6126/console" [puppet] - 10https://gerrit.wikimedia.org/r/1165913 (owner: 10Elukey) [15:01:24] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2298.codfw.wmnet [15:02:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [15:02:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti6002.drmrs.wmnet to cluster drmrs02 and group B13 [15:03:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10968238 (10MoritzMuehlenhoff) [15:03:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti6002.drmrs.wmnet to cluster drmrs02 and group B13 [15:04:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1165606 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French) [15:04:40] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6127/console" [puppet] - 10https://gerrit.wikimedia.org/r/1165913 (owner: 10Elukey) [15:05:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow6001.drmrs.wmnet to drbd [15:06:26] !log dancy@deploy1003 Installing scap version "4.185.0" for 2 host(s) [15:06:37] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2298.codfw.wmnet [15:06:40] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2299.codfw.wmnet [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:59] !log jiji@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro,name=eqiad [15:08:14] !log dancy@deploy1003 Installation of scap version "4.185.0" completed for 2 hosts [15:11:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:25] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:11:51] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2299.codfw.wmnet [15:11:55] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2300.codfw.wmnet [15:12:28] (03CR) 10Joal: [C:03+1] Increase max_tis_per_query for airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) (owner: 10Btullis) [15:13:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:14:05] (03CR) 10Joal: [C:03+1] "Not knowing defaults I don't know by how much we grow the available resources, but I think growing them is positive! +1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165899 (https://phabricator.wikimedia.org/T398421) (owner: 10Btullis) [15:14:34] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host pc2014 [15:14:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2014 [15:14:44] (03PS1) 10Majavah: hieradata: Enable NAT logging on both codfw1dev cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1165919 (https://phabricator.wikimedia.org/T273734) [15:15:09] !log repool cp7006 [15:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow6001.drmrs.wmnet to drbd [15:15:45] PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:12] (03CR) 10Majavah: [C:03+2] hieradata: Enable NAT logging on both codfw1dev cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1165919 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [15:16:37] RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 87.53 ms [15:16:42] FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:17] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2300.codfw.wmnet [15:17:20] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2301.codfw.wmnet [15:18:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [15:20:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir6002.drmrs.wmnet to drbd [15:21:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:22:08] (03PS1) 10Majavah: natlog: Add explicit dependency to file_line [puppet] - 10https://gerrit.wikimedia.org/r/1165921 (https://phabricator.wikimedia.org/T273734) [15:22:37] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2301.codfw.wmnet [15:22:41] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2302.codfw.wmnet [15:26:11] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: rename class attribute for the citoid SLO [puppet] - 10https://gerrit.wikimedia.org/r/1165913 (owner: 10Elukey) [15:26:57] FIRING: [5x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:44] (03CR) 10Paladox: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [15:28:04] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2302.codfw.wmnet [15:28:07] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2303.codfw.wmnet [15:30:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir6002.drmrs.wmnet to drbd [15:30:47] PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:03] RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 87.48 ms [15:31:57] RESOLVED: [5x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:23] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2303.codfw.wmnet [15:33:27] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2304.codfw.wmnet [15:38:42] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2304.codfw.wmnet [15:38:46] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2305.codfw.wmnet [15:41:09] jouncebot: nowandnext [15:41:09] No deployments scheduled for the next 1 hour(s) and 18 minute(s) [15:41:09] In 1 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1700) [15:42:42] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2305.codfw.wmnet [15:42:46] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2306.codfw.wmnet [15:44:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165894 (https://phabricator.wikimedia.org/T398413) (owner: 10Daimona Eaytoy) [15:45:39] (03CR) 10Cmelo: [C:03+1] Rename EventRegistration::$meetingAddress to $address for cache compat [extensions/CampaignEvents] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165894 (https://phabricator.wikimedia.org/T398413) (owner: 10Daimona Eaytoy) [15:45:55] (03CR) 10FNegri: [C:03+2] maintain-views: Use linktarget and collation in categorylinks view [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [15:46:17] (03Merged) 10jenkins-bot: Rename EventRegistration::$meetingAddress to $address for cache compat [extensions/CampaignEvents] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165894 (https://phabricator.wikimedia.org/T398413) (owner: 10Daimona Eaytoy) [15:46:46] !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1165894|Rename EventRegistration::$meetingAddress to $address for cache compat (T398413)]] [15:46:48] T398413: TypeError: Cannot assign string to property MediaWiki\Extension\CampaignEvents\Event\EventRegistration::$meetingAddress of type ?MediaWiki\Extension\CampaignEvents\Address\Address - https://phabricator.wikimedia.org/T398413 [15:47:58] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs4010.ulsfo.wmnet with reason: katran migration [15:48:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:48:10] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2306.codfw.wmnet [15:48:14] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2307.codfw.wmnet [15:48:36] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs4010 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1165830 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:49:06] !log jnuche@deploy1003 jnuche, daimona: Backport for [[gerrit:1165894|Rename EventRegistration::$meetingAddress to $address for cache compat (T398413)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:49:56] !log jnuche@deploy1003 jnuche, daimona: Continuing with sync [15:51:42] (03PS2) 10Dzahn: remove legacy miscweb VM service names [dns] - 10https://gerrit.wikimedia.org/r/1165616 (https://phabricator.wikimedia.org/T397080) [15:52:04] (03CR) 10Dzahn: [C:03+1] "the decom cookbook has been executed on these" [dns] - 10https://gerrit.wikimedia.org/r/1165616 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [15:53:47] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2307.codfw.wmnet [15:53:50] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2308.codfw.wmnet [15:55:37] !log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165894|Rename EventRegistration::$meetingAddress to $address for cache compat (T398413)]] (duration: 08m 51s) [15:55:39] T398413: TypeError: Cannot assign string to property MediaWiki\Extension\CampaignEvents\Event\EventRegistration::$meetingAddress of type ?MediaWiki\Extension\CampaignEvents\Address\Address - https://phabricator.wikimedia.org/T398413 [15:55:44] (03Abandoned) 10Elukey: aux/dse: remove the usage of sha256 digest image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163713 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [15:55:57] (03Abandoned) 10Elukey: profile::thanos::swift: rename machinetranslation account [labs/private] - 10https://gerrit.wikimedia.org/r/1165909 (owner: 10Elukey) [15:56:08] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2308.codfw.wmnet [15:56:11] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2309.codfw.wmnet [15:56:37] !log switch lvs4010 to katran - 10.128.0.11 [15:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:45] wrong copy&pasta :) [15:59:05] (03PS1) 10Vgutierrez: hiera: Consolidate ulsfo liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1165928 (https://phabricator.wikimedia.org/T396561) [15:59:36] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165928 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [16:01:23] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2309.codfw.wmnet [16:01:26] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2310.codfw.wmnet [16:01:50] (03CR) 10Andrea Denisse: [C:03+2] centrallog: Log with standard and custom template [puppet] - 10https://gerrit.wikimedia.org/r/1163901 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [16:02:41] (03PS28) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [16:05:45] (03CR) 10Ssingh: [C:03+1] hiera: Consolidate ulsfo liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1165928 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [16:06:33] (03CR) 10Btullis: [C:03+2] Bump resources and shared buffers config for postgresql-airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165899 (https://phabricator.wikimedia.org/T398421) (owner: 10Btullis) [16:06:37] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2310.codfw.wmnet [16:06:41] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2311.codfw.wmnet [16:08:09] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:08:21] (03CR) 10Majavah: [C:03+1] Openstack web proxy: allow 'puppetencadmin' users to modify per-vm puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165155 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [16:08:35] (03CR) 10Vgutierrez: [C:03+2] hiera: Consolidate ulsfo liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1165928 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [16:10:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [16:10:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [16:12:02] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2311.codfw.wmnet [16:12:06] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2312.codfw.wmnet [16:13:14] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [16:13:54] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-cluster: Exclude w[d,c]ws from repooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1165908 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:17:23] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2312.codfw.wmnet [16:17:27] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2313.codfw.wmnet [16:21:41] (03PS1) 10Clément Goubert: admin::data: Add access for antonkokhwmde [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917) [16:21:41] (03CR) 10Clément Goubert: "Verified out of band via slack" [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917) (owner: 10Clément Goubert) [16:22:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10968656 (10Clement_Goubert) [16:22:43] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2313.codfw.wmnet [16:22:47] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2314.codfw.wmnet [16:27:28] (03CR) 10Andrew Bogott: [C:03+2] Openstack web proxy: allow 'proxyadmin' users to modify proxies [puppet] - 10https://gerrit.wikimedia.org/r/1165154 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [16:27:30] (03CR) 10Andrew Bogott: [C:03+2] Openstack web proxy: allow 'puppetencadmin' users to modify per-vm puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165155 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [16:28:16] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2314.codfw.wmnet [16:28:20] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2315.codfw.wmnet [16:29:20] (03PS2) 10Clément Goubert: admin::data: Add access for antonkokhwmde [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917) [16:30:22] (03CR) 10Ssingh: [C:03+1] "Verified uid and access requirement." [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917) (owner: 10Clément Goubert) [16:30:38] (03CR) 10Clément Goubert: [C:03+2] admin::data: Add access for antonkokhwmde [puppet] - 10https://gerrit.wikimedia.org/r/1165936 (https://phabricator.wikimedia.org/T395917) (owner: 10Clément Goubert) [16:33:32] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2315.codfw.wmnet [16:33:35] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2316.codfw.wmnet [16:34:03] 06SRE, 13Patch-For-Review: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10968700 (10Clement_Goubert) For the record, this is this incident https://www.wikimediastatus.net/incidents/57jsxtn7hlvf [16:34:43] (03CR) 10Btullis: [C:03+2] Increase max_tis_per_query for airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) (owner: 10Btullis) [16:35:10] 06SRE, 06Infrastructure-Foundations: Netbox: PupeptDB Import - ignore 'vxlan' and 'openvswitch' interfaces without IPs - https://phabricator.wikimedia.org/T398464 (10cmooney) 03NEW p:05Triage→03Low [16:36:21] (03Merged) 10jenkins-bot: Increase max_tis_per_query for airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165900 (https://phabricator.wikimedia.org/T396686) (owner: 10Btullis) [16:37:22] 06SRE, 13Patch-For-Review: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10968724 (10Clement_Goubert) p:05Triage→03Medium Incident is resolved, setting medium priority for follow-up. [16:39:03] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2316.codfw.wmnet [16:39:07] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2317.codfw.wmnet [16:39:40] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [16:40:10] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [16:40:42] (03PS1) 10Andrew Bogott: Prepare cloudcephmon nodes in codfw for ceph v16 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1165940 (https://phabricator.wikimedia.org/T306820) [16:40:44] (03PS1) 10Andrew Bogott: Prepare cloudcephosd nodes in codfw for ceph v16 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1165941 (https://phabricator.wikimedia.org/T306820) [16:43:21] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2317.codfw.wmnet [16:43:24] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2318.codfw.wmnet [16:43:30] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:44:03] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165941 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott) [16:44:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165940 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott) [16:44:26] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:45:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10968745 (10Clement_Goubert) Shell access and kerberos principal created, i... [16:45:57] (03PS2) 10Volans: kubernetes: improve naming [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) [16:46:20] (03CR) 10Volans: "following today's IRC discussion this is the final proposal with proper naming." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1165847 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:47:46] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephmon nodes in codfw for ceph v16 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1165940 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott) [16:47:52] !log bking@cumin1002 restarting cirrrussearch codfw T397227 [16:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:55] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [16:48:28] FIRING: [6x] SystemdUnitFailed: opensearch_1@production-search-omega-codfw.service on cirrussearch2071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:36] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2318.codfw.wmnet [16:48:39] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2319.codfw.wmnet [16:48:43] PROBLEM - OpenSearch health check for shards on 9400 on cirrussearch2099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f835f0dd1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [16:48:43] dia.org/wiki/Search%23Administration [16:50:43] RECOVERY - OpenSearch health check for shards on 9400 on cirrussearch2099 is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: green, timed_out: False, number_of_nodes: 27, number_of_data_nodes: 27, discovered_master: True, active_primary_shards: 1710, active_shards: 5125, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number [16:50:43] ing_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [16:53:40] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2319.codfw.wmnet [16:53:43] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2320.codfw.wmnet [16:58:55] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2320.codfw.wmnet [16:58:59] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2321.codfw.wmnet [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1700) [17:04:31] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2321.codfw.wmnet [17:04:35] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2322.codfw.wmnet [17:10:03] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2322.codfw.wmnet [17:10:06] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2323.codfw.wmnet [17:12:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:15:34] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2323.codfw.wmnet [17:15:37] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2324.codfw.wmnet [17:18:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:21:03] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2324.codfw.wmnet [17:21:06] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2325.codfw.wmnet [17:23:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:26:23] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2325.codfw.wmnet [17:26:26] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2326.codfw.wmnet [17:27:43] (03CR) 10Scott French: [C:03+2] aptrepo: add pcre2-php83-bullseye to Update list [puppet] - 10https://gerrit.wikimedia.org/r/1165606 (https://phabricator.wikimedia.org/T398245) (owner: 10Scott French) [17:28:07] (03CR) 10Dzahn: [C:03+2] remove legacy miscweb VM service names [dns] - 10https://gerrit.wikimedia.org/r/1165616 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [17:28:47] !log dzahn@dns1004 START - running authdns-update [17:29:57] !log dzahn@dns1004 END - running authdns-update [17:31:42] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2326.codfw.wmnet [17:31:46] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2327.codfw.wmnet [17:32:49] (03PS1) 10Dzahn: miscweb: delete role and miscweb::httpd profile [puppet] - 10https://gerrit.wikimedia.org/r/1165955 (https://phabricator.wikimedia.org/T397080) [17:34:03] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2327.codfw.wmnet [17:34:06] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2328.codfw.wmnet [17:36:28] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2328.codfw.wmnet [17:36:31] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2329.codfw.wmnet [17:40:25] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [17:41:44] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2329.codfw.wmnet [17:41:48] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2330.codfw.wmnet [17:42:25] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.011 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [17:47:00] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2330.codfw.wmnet [17:52:41] (03CR) 10AOkoth: [C:03+1] miscweb: delete role and miscweb::httpd profile [puppet] - 10https://gerrit.wikimedia.org/r/1165955 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [17:53:15] (03CR) 10Dzahn: [C:03+2] miscweb: delete role and miscweb::httpd profile [puppet] - 10https://gerrit.wikimedia.org/r/1165955 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [17:53:45] !log reprepro update component/php83 with pcre2 10.42-1~wmf11+1 - T398245 [17:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:48] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [18:00:05] jnuche and jeena: Time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T1800). [18:05:05] (03CR) 10Ssingh: "Output for review:" [puppet] - 10https://gerrit.wikimedia.org/r/1164296 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [18:10:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:11:25] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:11:27] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:12:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:18:47] 06SRE, 10MW-1.45-notes (1.45.0-wmf.9; 2025-07-08), 07Wikimedia-Incident: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448#10969186 (10Aklapper) [18:20:29] (03PS1) 10Samtar: labstore: Add dumpstorrents project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1165962 (https://phabricator.wikimedia.org/T398477) [18:29:25] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:29:27] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:29:41] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd nodes in codfw for ceph v16 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1165941 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott) [18:30:27] (03PS2) 10Majavah: cloudnfs: Add dumpstorrents project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1165962 (https://phabricator.wikimedia.org/T398477) (owner: 10Samtar) [18:31:11] (03CR) 10Majavah: [C:03+2] cloudnfs: Add dumpstorrents project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/1165962 (https://phabricator.wikimedia.org/T398477) (owner: 10Samtar) [18:32:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:42:15] !log reprepro include php8.3_8.3.22-1+wmf11u1 in component/php83 - T398245 [18:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:18] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [19:10:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:11:14] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#10969492 (10Jhancock.wm) [19:11:23] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:18:09] PROBLEM - Host ssw1-d8-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:18:27] PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:18:27] PROBLEM - Host lsw1-d8-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:19:09] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:19:15] PROBLEM - Host ps1-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:28:28] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch2092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:32] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [19:29:17] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2092 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:38:28] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch2092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:39:17] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2092 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:43:47] (03CR) 10Hashar: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [19:47:35] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:04:13] is anyone deploying? [20:04:23] (03PS2) 10Krinkle: beta: Switch excimer-ui-url service from wmflabs.org to wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163502 (https://phabricator.wikimedia.org/T289318) [20:04:28] (03PS4) 10Krinkle: beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) [20:04:32] (03PS3) 10Krinkle: multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318) [20:06:32] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [20:06:52] !log krinkle@deploy1003:/srv/mediawiki$ git remote rm gerrit -- Fix `jforrester@gerrit.wikimedia.org: Permission denied (publickey).` There were two remotes: $ git remote -v gerrit ssh://jforrester@gerrit origin ssh://gerrit.wikimedia.org:29418 [20:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:37] (03CR) 10Krinkle: [C:03+2] beta: Switch excimer-ui-url service from wmflabs.org to wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163502 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:09:29] (03Merged) 10jenkins-bot: beta: Switch excimer-ui-url service from wmflabs.org to wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163502 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:10:12] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [20:11:18] (03PS5) 10Krinkle: beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) [20:11:46] (03CR) 10Krinkle: [C:03+2] beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:11:50] (03PS4) 10Krinkle: multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318) [20:12:16] (03CR) 10Krinkle: [C:03+2] multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:12:35] (03Merged) 10jenkins-bot: beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:13:14] (03Merged) 10jenkins-bot: multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:28:36] !log krinkle@deploy1003 Started scap sync-world: Beta patches Iff58893f, I62b31535, I228d7766a57 [20:29:04] !log reprepro include php-defaults_94+wmf11u1 in component/php83 - T398245 [20:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:06] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [20:29:24] (03CR) 10FNegri: [C:03+2] "After merging I realized this hasn't been +1d from the Data Engineering team, and they are the owners of maintain-views.yaml. [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [20:30:40] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bookworm [20:31:42] !log krinkle@deploy1003 Finished scap sync-world: Beta patches Iff58893f, I62b31535, I228d7766a57 (duration: 03m 06s) [20:32:30] (03PS1) 10Krinkle: missing.php: Support beta suffix for auth.wikimedia error page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165983 (https://phabricator.wikimedia.org/T289318) [20:33:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:33:52] (03CR) 10Krinkle: missing.php: Support beta suffix for auth.wikimedia error page (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165983 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:34:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165983 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:34:12] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:34:18] !log reprepro include dh-php_5.5+wmf11u1 in component/php83 - T398245 [20:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:21] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [20:34:51] (03Merged) 10jenkins-bot: missing.php: Support beta suffix for auth.wikimedia error page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165983 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:35:16] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1165983|missing.php: Support beta suffix for auth.wikimedia error page (T289318)]] [20:35:18] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [20:36:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 6.474 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:36:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.566 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:37:25] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1165983|missing.php: Support beta suffix for auth.wikimedia error page (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:40:45] (03CR) 10D3r1ck01: [C:03+1] "Made: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1165984" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108 (owner: 10Krinkle) [20:42:28] (03PS1) 10Cwhite: logstash: pass through normalized arrays from filter-on-template v1 [puppet] - 10https://gerrit.wikimedia.org/r/1165988 (https://phabricator.wikimedia.org/T234565) [20:46:05] (03PS1) 10Krinkle: beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) [20:47:22] (03PS2) 10Krinkle: beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) [20:49:22] (03CR) 10Bking: [C:03+1] hiera,cirrus: Enable IPIP on search*@codfw services [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [20:49:58] (03CR) 10Bking: [C:03+1] hiera,cirrus: Enable IPIP on search*@eqiad services [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [20:50:39] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1206:9290 - https://phabricator.wikimedia.org/T397978#10969894 (10Jclark-ctr) 05Open→03Resolved Received replacement psu server has dual power [20:51:06] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [20:51:38] (03CR) 10Cwhite: [C:03+2] logstash: pass through normalized arrays from filter-on-template v1 [puppet] - 10https://gerrit.wikimedia.org/r/1165988 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:53:04] (03PS2) 10Andrew Bogott: Openstack common/servicetoken.erb: remove a misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/1143611 [20:55:04] (03PS2) 10Krinkle: wmf-config: Fix filename typo in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108 [20:59:24] !log krinkle@deploy1003 krinkle: Continuing with sync [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T2100) [21:04:25] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-06-23-151702 to 2025-07-02-122843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165903 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin) [21:04:32] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-06-24-204920 to 2025-07-02-123323 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165890 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin) [21:05:10] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165983|missing.php: Support beta suffix for auth.wikimedia error page (T289318)]] (duration: 29m 54s) [21:05:14] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [21:08:54] (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade evaluators from 2025-06-23-151702 to 2025-07-02-122843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165903 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin) [21:08:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:10:45] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-06-23-151702 to 2025-07-02-122843 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165903 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin) [21:11:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:30] !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:12:37] !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:12:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:13:20] !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:14:05] !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:15:58] !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:16:54] !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:17:42] 10ops-codfw, 06DC-Ops: Power Supply - PS2 Status - issue on wikikube-worker2320:9290 - https://phabricator.wikimedia.org/T398514 (10phaultfinder) 03NEW [21:18:00] (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-06-24-204920 to 2025-07-02-123323 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165890 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin) [21:19:35] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-06-24-204920 to 2025-07-02-123323 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165890 (https://phabricator.wikimedia.org/T391208) (owner: 10David Martin) [21:20:34] !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:20:58] !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:22:22] !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:22:58] !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:23:18] !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:23:47] !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:24:29] (03PS1) 10Zabe: ApiQueryCategoryMembers: Use correct index for categorylinks [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165996 (https://phabricator.wikimedia.org/T385890) [21:32:54] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161661 (owner: 10PipelineBot) [21:32:57] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1164168 (owner: 10PipelineBot) [21:33:02] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165065 (owner: 10PipelineBot) [21:33:05] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165150 (owner: 10PipelineBot) [21:33:22] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162238 (owner: 10PipelineBot) [21:33:41] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165496 (owner: 10PipelineBot) [21:35:52] (03PS1) 10Krinkle: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 [21:36:30] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 302661744 and 25 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:36:43] (03CR) 10CI reject: [V:04-1] beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (owner: 10Krinkle) [21:37:30] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 59624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:42:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108 (owner: 10Krinkle) [21:42:48] (03PS2) 10Krinkle: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 [21:43:10] (03Merged) 10jenkins-bot: wmf-config: Fix filename typo in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154108 (owner: 10Krinkle) [21:49:36] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:55:51] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:59:11] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T2200) [22:02:21] jhathaway@cumin2002 provision (PID 4172609) is awaiting input [22:02:30] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [22:04:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10970263 (10VRiley-WMF) We have received the Seed Server for this unit. Would we like to use a new/different name but set it up in the same location? [22:07:53] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [22:08:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:08:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:11:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:43] jouncebot: nowandnext [22:12:43] For the next 0 hour(s) and 47 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250702T2200) [22:12:43] In 7 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0600) [22:12:43] In 7 hour(s) and 47 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250703T0600) [22:12:46] (03PS5) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) [22:12:50] (03CR) 10Zabe: [C:03+2] ApiQueryCategoryMembers: Use correct index for categorylinks [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165996 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe) [22:12:51] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:14:22] (03PS6) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) [22:14:59] (03CR) 10Dzahn: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [22:16:04] (03CR) 10ArielGlenn: [C:03+1] "Seems fine to me. If you can't find someone else to merge it, I'll be happy to." [puppet] - 10https://gerrit.wikimedia.org/r/1165526 (owner: 10D3r1ck01) [22:16:22] (03CR) 10CI reject: [V:04-1] logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:16:48] (03Merged) 10jenkins-bot: ApiQueryCategoryMembers: Use correct index for categorylinks [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1165996 (https://phabricator.wikimedia.org/T385890) (owner: 10Zabe) [22:17:33] (03PS3) 10Krinkle: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (https://phabricator.wikimedia.org/T289318) [22:17:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1165996|ApiQueryCategoryMembers: Use correct index for categorylinks (T385890 T398448)]] [22:17:51] T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890 [22:17:52] T398448: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448 [22:18:40] (03PS7) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) [22:19:16] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:19:30] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [22:19:55] !log zabe@deploy1003 zabe: Backport for [[gerrit:1165996|ApiQueryCategoryMembers: Use correct index for categorylinks (T385890 T398448)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:19:59] (03PS4) 10Krinkle: beta: Change Beta wikidata canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165999 (https://phabricator.wikimedia.org/T289318) [22:20:37] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [22:20:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10970299 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [22:20:55] (03CR) 10Krinkle: [C:03+2] beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [22:21:11] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [22:21:35] !log zabe@deploy1003 zabe: Continuing with sync [22:21:37] (03CR) 10Krinkle: beta: Include allowance for wmcloud.org in wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165989 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [22:25:30] (03PS8) 10Cwhite: logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) [22:26:59] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165996|ApiQueryCategoryMembers: Use correct index for categorylinks (T385890 T398448)]] (duration: 09m 12s) [22:27:03] T385890: Add support for read new for categorylinks migration - https://phabricator.wikimedia.org/T385890 [22:27:04] T398448: HTTP 503 errors trying to reach Wikipedia: 2025-07-02 s4 overload - https://phabricator.wikimedia.org/T398448 [22:27:23] Krinkle: you can merge your patch now [22:27:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10970307 (10VRiley-WMF) [22:29:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=wdqs2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:29:51] zabe: it's ok, I'll roll it out later. I've got a few errands to run first. [22:30:00] thx for the ping [22:30:03] alright [22:33:41] (03PS1) 10BryanDavis: zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936) [22:36:28] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 85041MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [22:37:18] ryankemper: see the above page. wdqs2009 is acting up. I see a blazegraph restart in SAl [22:37:21] L [22:37:30] is that the recommended course of action? [22:37:43] !incidents [22:37:43] 6451 (UNACKED) ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin) [22:37:43] 6450 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [22:37:44] 6448 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [22:37:44] 6445 (RESOLVED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [22:37:44] 6449 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:37:44] 6447 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [22:37:44] 6446 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:37:50] !ack 6451 [22:37:51] 6451 (ACKED) ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin) [22:38:02] sukhe: yes it is [22:38:04] inflatador: see above as well [22:38:09] let me look into making service not page [22:38:31] ryankemper: ok thanks. can you take care of it please? not really near a computer rn [22:38:39] I acked the page [22:38:50] yeah I've got it [22:38:56] ah, I was just flagging in -sre, whoops [22:39:17] ryankemper: <3 [22:39:26] thanks swfrench-wmf [22:39:37] (03PS1) 10Cwhite: add docs for string_to_numeric_conversion_failure [software/ecs] - 10https://gerrit.wikimedia.org/r/1166008 (https://phabricator.wikimedia.org/T234565) [22:40:00] !log [WDQS] Restart wdqs-blazegraph on wdqs2009 [22:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:49] So is this alert a generic one that will apply regardless of `page: false` being set in service.yaml? [22:41:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10970332 (10VRiley-WMF) While trying to image these servers, it seems to lock up during the reboot with just a generic time out reason. Verified that the s... [22:41:28] Because ideally i don't want this host paging. It's a single wdqs full graph host that will be kept online for next few months for legacy reasons but we don't make any guarantees to users as to its availability [22:43:28] (03CR) 10Cwhite: [C:03+2] logstash: convert ECS numeric fields from strings [puppet] - 10https://gerrit.wikimedia.org/r/1164522 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:44:23] (03PS1) 10JHathaway: preseed: fix match for sretest [puppet] - 10https://gerrit.wikimedia.org/r/1166010 [22:44:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=wdqs2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:46:03] ryankemper: yeah pretty much. this is alerting because ATS is not happy with the backend [22:47:25] (03CR) 10JHathaway: [C:03+2] preseed: fix match for sretest [puppet] - 10https://gerrit.wikimedia.org/r/1166010 (owner: 10JHathaway) [22:47:37] https://github.com/wikimedia/operations-alerts/blob/master/team-sre/cdn.yaml [22:49:37] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [22:49:43] !log dancy@deploy1003 Installing scap version "4.186.0" for 2 host(s) [22:50:40] (03PS1) 10Arlolra: ExtensionDistributor: Mark 1.44 as stable; remove 1.42 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166012 (https://phabricator.wikimedia.org/T390798) [22:51:31] !log dancy@deploy1003 Installation of scap version "4.186.0" completed for 2 hosts [22:51:53] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [23:01:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2009:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:02:08] !log [WDQS] `ryankemper@wdqs2009:~$ sudo systemctl restart prometheus-blazegraph-exporter-wdqs-blazegraph.service` [23:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:29] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [23:05:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=wdqs2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:05:52] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [23:07:28] !incidents [23:07:28] 6452 (UNACKED) ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin) [23:07:29] 6451 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin) [23:07:29] 6450 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [23:07:29] 6448 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [23:07:29] 6445 (RESOLVED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [23:07:30] 6449 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [23:07:30] 6447 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [23:07:30] 6446 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [23:07:36] !ack 6452 [23:07:37] 6452 (ACKED) ATSBackendErrorsHigh cache_text sre (wdqs2009.codfw.wmnet eqsin) [23:07:39] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [23:08:17] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [23:08:18] ryankemper: thanks for responding during the previous instance of this. does the service need another restart, or is there some other mitigation needed? [23:09:52] also yeah, as s.ukhe pointed out, this is decoupled from the `page: false` for various services defined in the catalog, which (IIUC) largely controls catalog-controlled monitoring, like probes [23:10:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=wdqs2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:11:09] given the state of wdqs2009, would it make sense to add it to the exclusion regex in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/cdn.yaml ? [23:11:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2009:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:11:44] swfrench-wmf: absolutely [23:12:36] ryankemper: great, let me open a task for that [23:16:26] 06SRE: Suppress ATSBackendErrorsHigh for wdqs2009.codfw.wmnet - https://phabricator.wikimedia.org/T398523 (10Scott_French) 03NEW [23:16:48] (03PS1) 10Ryan Kemper: wdqs: disable ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1166016 (https://phabricator.wikimedia.org/T398523) [23:16:56] swfrench-wmf: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1166016 [23:17:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:18:03] ah, awesome! [23:18:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:19:52] (03CR) 10Scott French: [C:03+1] wdqs: disable ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1166016 (https://phabricator.wikimedia.org/T398523) (owner: 10Ryan Kemper) [23:20:20] jhathaway@cumin2002 reimage (PID 2100) is awaiting input [23:21:24] 06SRE, 06Data-Platform-SRE, 13Patch-For-Review: Suppress ATSBackendErrorsHigh for wdqs2009.codfw.wmnet - https://phabricator.wikimedia.org/T398523#10970397 (10Scott_French) [23:25:27] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2001.codfw.wmnet with OS bookworm [23:27:23] (03CR) 10Ryan Kemper: [C:03+2] wdqs: disable ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1166016 (https://phabricator.wikimedia.org/T398523) (owner: 10Ryan Kemper) [23:28:32] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [23:28:38] (03Merged) 10jenkins-bot: wdqs: disable ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1166016 (https://phabricator.wikimedia.org/T398523) (owner: 10Ryan Kemper) [23:34:25] FIRING: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:17] !log removing 15 files for legal compliance [23:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166023 [23:38:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166023 (owner: 10TrainBranchBot) [23:40:51] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [23:40:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10970426 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed... [23:49:46] (03PS1) 10Dzahn: initial commit - add .gitreview file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166037 (https://phabricator.wikimedia.org/T268199) [23:50:04] (03CR) 10Dzahn: [C:03+2] initial commit - add .gitreview file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166037 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [23:50:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166023 (owner: 10TrainBranchBot) [23:53:01] (03CR) 10Dzahn: [V:03+2 C:03+2] initial commit - add .gitreview file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166037 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [23:53:23] (03PS1) 10Dzahn: add initial blubber .pipeline config and a README [container/codesearch] - 10https://gerrit.wikimedia.org/r/1166044 (https://phabricator.wikimedia.org/T268199) [23:59:25] RESOLVED: SystemdUnitFailed: user@499.service on poolcounter1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed