[00:02:23] <ebernhardson>	 !jouncebot next
[00:02:23] <wm-bot>	 a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[00:02:40] <ebernhardson>	 jouncebot: next
[00:02:40] <jouncebot>	 In 5 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0600)
[00:02:41] <jouncebot>	 In 5 hour(s) and 57 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0600)
[00:03:10] <ebernhardson>	 If there are no complaints, I'm going to undeploy a mitigation for search-traffic in mediawiki-config
[00:03:29] <ebernhardson>	 (there is now a requestctl rule addressing the issue, and followup heuristics in cirrus to be deployed next week)
[00:04:04] <wikibugs>	 (03PS2) 10Ebernhardson: Revert "cirrus: Send more_like traffic to eqiad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191168 (https://phabricator.wikimedia.org/T405394)
[00:04:50] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:05:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212532 (10phaultfinder)
[00:05:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191168 (https://phabricator.wikimedia.org/T405394) (owner: 10Ebernhardson)
[00:07:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "cirrus: Send more_like traffic to eqiad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191168 (https://phabricator.wikimedia.org/T405394) (owner: 10Ebernhardson)
[00:07:41] <logmsgbot>	 !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1191168|Revert "cirrus: Send more_like traffic to eqiad" (T405394)]]
[00:07:48] <stashbot>	 T405394: Point cirrussearch morelike queries to EQIAD - https://phabricator.wikimedia.org/T405394
[00:08:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191197
[00:08:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191197 (owner: 10TrainBranchBot)
[00:11:53] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1220.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:12:05] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1191168|Revert "cirrus: Send more_like traffic to eqiad" (T405394)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:12:26] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1217.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:12:40] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1215.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:12:45] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Continuing with sync
[00:13:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1219.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:13:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1224.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:13:49] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1218.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:14:25] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1223.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:15:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1225.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:15:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1226.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:17:30] <logmsgbot>	 !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191168|Revert "cirrus: Send more_like traffic to eqiad" (T405394)]] (duration: 09m 48s)
[00:17:36] <stashbot>	 T405394: Point cirrussearch morelike queries to EQIAD - https://phabricator.wikimedia.org/T405394
[00:19:36] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1221.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:19:52] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1227.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:21:11] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1222.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:21:43] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1228.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:24:43] <wikibugs>	 (03PS1) 10RLazarus: wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366)
[00:25:54] <wikibugs>	 (03PS2) 10RLazarus: wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366)
[00:33:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191197 (owner: 10TrainBranchBot)
[00:34:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11212563 (10phaultfinder)
[00:39:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11212569 (10phaultfinder)
[00:39:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11212570 (10phaultfinder)
[00:40:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1224.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:41:04] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1223.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:41:53] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1225.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:42:27] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1226.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:42:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1229.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:44:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1230.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:44:50] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1232.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:44:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212571 (10phaultfinder)
[00:45:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1231.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:46:20] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1227.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:48:07] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1228.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:48:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1216.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:54:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11212573 (10phaultfinder)
[01:17:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212619 (10Jclark-ctr)
[01:20:20] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye
[01:20:24] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1210.eqiad.wmnet with OS bullseye
[01:20:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1210.eqiad.wmnet with OS bullseye
[01:20:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1210.eqiad.wmnet with OS bullseye executed with errors: - an-worker...
[01:29:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye
[01:29:31] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1210.eqiad.wmnet with OS bullseye
[01:29:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212627 (10phaultfinder)
[01:30:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11212628 (10phaultfinder)
[01:31:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[01:33:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[01:36:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:36:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212635 (10Jclark-ctr) @BTullis I’ve only set up RAID1 for an-worker1210. I wanted to get one running before the end of the night, but I’m not having any luck. Could you help me wit...
[01:41:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:44:50] <jinxer-wm>	 FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[01:45:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11212636 (10phaultfinder)
[01:52:44] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11212641 (10Krinkle) >>! In T122097#2657531, @BBlack wrote: > This has been idle a while, but it's still probably a good...
[01:54:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11212646 (10phaultfinder)
[01:54:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212647 (10phaultfinder)
[01:59:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11212651 (10phaultfinder)
[02:09:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11212654 (10phaultfinder)
[02:12:17] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:16:44] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:17:17] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:19:44] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:19:50] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[02:19:54] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[02:24:50] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[02:24:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212655 (10phaultfinder)
[02:24:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212656 (10phaultfinder)
[02:27:17] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:37:17] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:49:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11212660 (10phaultfinder)
[02:58:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:58:30] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:59:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212661 (10phaultfinder)
[03:00:20] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:03:30] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:03:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:06:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:06:44] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:06:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:20:20] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:21:34] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:21:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:29:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212667 (10phaultfinder)
[03:44:50] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[04:04:50] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:05:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212674 (10phaultfinder)
[04:26:28] <wikibugs>	 (03CR) 10Ladsgroup: "Yup. I can take care of it if you focus on getting mw code adopted." [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) (owner: 10Zabe)
[04:44:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=wdqs-main.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[04:44:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11212699 (10phaultfinder)
[04:45:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[04:46:19] <Amir1>	 Sorta here
[04:46:25] <Amir1>	 In airport...
[04:49:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[04:50:34] <Amir1>	 okay, got the laptop up. Looking...
[04:50:58] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[04:53:15] <Amir1>	 mw errors has jumped but I can't see any jump in logstash :/ https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&from=now-6h&to=now&timezone=utc&var-site=$__all&var-deployment=mw-web&var-method=GET&var-code=200&var-handler=php&var-service=mediawiki&refresh=1m&viewPanel=panel-63
[04:54:10] <_joe_>	 Amir1: have you seen the numbers?
[04:54:31] <Amir1>	 it's low but it's paging because wdqs updater is falling behind
[04:54:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[04:55:58] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[04:56:14] <Amir1>	 !incidents
[04:56:14] <sirenbot>	 6795 (UNACKED)  ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet codfw)
[04:56:14] <sirenbot>	 6787 (RESOLVED)  [2x] ProbeDown sre (dse-k8s-ctrl2002:6443 probes/custom codfw)
[04:56:24] <Amir1>	 calling search platform
[04:58:22] <Amir1>	 !ack 6795
[04:58:22] <sirenbot>	 6795 (ACKED)  ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet codfw)
[04:58:28] <Amir1>	 Acking so it stops paging me
[04:58:33] <Amir1>	 Calling Ryan
[04:59:18] <_joe_>	 Amir1: you shouldn't be the only person being paged. it's 10 pm on the US west coast, I get paged until 11 pm
[04:59:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[05:00:49] <_joe_>	 Amir1: it's recovering btw
[05:01:07] <_joe_>	 https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-3c-3e-origin-servers-overview?orgId=1&from=now-24h&to=now&timezone=utc&var-site=esams&var-cluster=text&var-origin=wdqs-main.discovery.wmnet&var-origin=wdqs-scholarly.discovery.wmnet&var-origin=wdqs.discovery.wmnet&viewPanel=panel-12
[05:01:45] <_joe_>	 ehhh not really actually
[05:03:16] <Amir1>	 guillaume is waking up and will call someone
[05:03:57] <_joe_>	 All wdqs-main servers in codfw are marked partially up
[05:04:20] <_joe_>	 Fetch failed (https://localhost/readiness-probe)
[05:04:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[05:05:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Requesting Kerberos access for sd - https://phabricator.wikimedia.org/T405219#11212705 (10SD0001) Got the email, and have reset the temporary password. Thanks!
[05:05:10] <_joe_>	 load on the servers is in the 100s
[05:05:24] <gehel>	 ok, I?m here...
[05:05:43] <Amir1>	 thanks
[05:05:54] <Amir1>	 I think all codfw updaters are broken it seems?
[05:05:58] <jinxer-wm>	 FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:07:13] <gehel>	 so, the updater itself seems ok, but the WDQS servers are overloaded and can't apply updates?
[05:09:09] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212706 (10phaultfinder)
[05:10:29] <gehel>	 we're sending all traffic to codfw, so more load per server than usual. We should be provisionned to handle that, but maybe we're not.
[05:10:58] <jinxer-wm>	 FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:12:57] <gehel>	 I'm trying to get hold of David.
[05:14:49] <gehel>	 WDQS SLO are low enough, it's not the end of the world if it is down for a few hours.
[05:14:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212707 (10phaultfinder)
[05:15:58] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:17:33] <Amir1>	 I have to board the plane. I don't have access for a couple of hours
[05:17:48] <Amir1>	 the page is acked so it shouldn't wake up anyone else
[05:18:03] <Amir1>	 (unless it doesn't resolve in 24 hours, which I hope not)
[05:18:54] <gehel>	 Amir1: thanks a lot ! 
[05:19:06] <_joe_>	 I'll take a look at traffic patterns in the meantime
[05:21:22] <wikibugs>	 (03PS1) 10KartikMistry: cxserver: staging: Update to 2025-09-25-051716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191231 (https://phabricator.wikimedia.org/T394982)
[05:22:08] <kart_>	 Updating cxserver/staging ^
[05:23:19] <_joe_>	 gehel: is there a cookbook/runbook to roll restart blazegraph? this looks like a query-of-death situation tbh
[05:23:30] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update to 2025-09-25-051716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191231 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry)
[05:23:32] <_joe_>	 traffic wasn't particularly elevated when this happened
[05:24:24] <gehel>	 There should be one
[05:24:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212712 (10phaultfinder)
[05:25:01] <gehel>	 I have to feed the kids and send them to school. I'll be back in 40'
[05:25:11] <wikibugs>	 (03Merged) 10jenkins-bot: cxserver: staging: Update to 2025-09-25-051716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191231 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry)
[05:25:54] <_joe_>	 Oh I have stuff to do too.... I guess I'll get going.
[05:26:40] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:27:02] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:29:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[05:30:54] <kart_>	 !log staging: Updated cxserver to 2025-09-25-051716-production (T394982)
[05:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:02] <stashbot>	 T394982: Migrate cxserver in production to node22 - https://phabricator.wikimedia.org/T394982
[05:38:18] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for newly created arbcom_plwiki - https://phabricator.wikimedia.org/T405543 (10Superpes15) 03NEW
[05:39:09] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:44:50] <jinxer-wm>	 FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[05:45:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11212737 (10phaultfinder)
[05:55:58] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:59:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0600).
[06:04:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:04:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11212755 (10phaultfinder)
[06:14:09] <jinxer-wm>	 FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[06:15:32] <wikibugs>	 (03PS1) 10Kosta Harlan: CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191233 (https://phabricator.wikimedia.org/T405342)
[06:16:03] <kostajh>	 Amir1: are you deploying now, or can I sync something? 
[06:16:44] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:19:44] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:19:50] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[06:19:54] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:20:03] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:20:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191233 (https://phabricator.wikimedia.org/T405342) (owner: 10Kosta Harlan)
[06:21:23] <wikibugs>	 (03Merged) 10jenkins-bot: CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191233 (https://phabricator.wikimedia.org/T405342) (owner: 10Kosta Harlan)
[06:21:54] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1191233|CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis (T405342)]]
[06:22:00] <stashbot>	 T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342
[06:24:50] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[06:24:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:26:23] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1191233|CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis (T405342)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[06:28:19] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[06:29:58] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: tls: ban default UAs with forge URLs [puppet] - 10https://gerrit.wikimedia.org/r/1190004 (https://phabricator.wikimedia.org/T400119)
[06:31:31] <gehel>	 !log restarting blazegraph on wdqs-main@codfw
[06:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:30] <wikibugs>	 (03CR) 10Muehlenhoff: "maps2011-maps2014 are now fully replicated, good to merge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190578 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[06:33:16] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191233|CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis (T405342)]] (duration: 11m 22s)
[06:33:22] <stashbot>	 T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342
[06:33:34] <wikibugs>	 (03PS1) 10TChin: [eventgate-*] Bump to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191234 (https://phabricator.wikimedia.org/T403169)
[06:33:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Because the logic is more complex than what that would imply." [puppet] - 10https://gerrit.wikimedia.org/r/1190004 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[06:34:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[06:34:51] <jinxer-wm>	 RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:35:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] tls: ban default UAs with forge URLs [puppet] - 10https://gerrit.wikimedia.org/r/1190004 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[06:35:58] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:37:17] <jinxer-wm>	 RESOLVED: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:39:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[06:40:58] <jinxer-wm>	 RESOLVED: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:43:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet
[06:44:14] <moritzm>	 aux-k8s-etcd1003, dse-k8s-etcd1001 and kubestagemaster1005 will go down for a Ganeti reboot
[06:44:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet
[06:45:52] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[06:46:14] <icinga-wm>	 PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100%
[06:46:32] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:50:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet
[06:50:28] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms
[06:50:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet
[06:50:54] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[06:50:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster1005.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:51:16] <icinga-wm>	 RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms
[06:52:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet
[06:54:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11212804 (10phaultfinder)
[06:55:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:55:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster1005.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:57:30] <wikibugs>	 (03PS21) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688)
[06:58:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:58:45] <logmsgbot>	 jmm@cumin2002 drain-node (PID 172439) is awaiting input
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0700).
[07:00:05] <jouncebot>	 James_F, sergi0, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:10] <James_F>	 Heya.
[07:00:15] <sergi0>	 o/
[07:00:32] <James_F>	 sergi0: Did you want to deploy? You should go first either way. Happy to do it.
[07:00:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet
[07:01:31] <wikibugs>	 (03CR) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[07:03:24] <sergi0>	 James_F go for it, I will test
[07:03:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:03:49] <James_F>	 Hmm.
[07:04:10] <James_F>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1190698 depends on a MetricsPlatform patch, so you'll need that cherry-picked too?
[07:05:27] <James_F>	 … which doesn't cherry-pick cleanly, eurgh.
[07:05:35] <sergi0>	 hmm, I should have removed the depends, the MP patch is already in wmf.20 as far I can see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MetricsPlatform/+/1189522
[07:05:40] <James_F>	 sergi0: Can you create the cherry-picks?
[07:05:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:06:03] <James_F>	 Oh, right, but SpiderPig thinks it isn't.
[07:06:12] <James_F>	 Yeah, I'll just drop the dependency.
[07:06:24] <wikibugs>	 (03PS2) 10Jforrester: ExperimentXLabManager: allow to re-enroll a user in experiments [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno)
[07:06:30] <sergi0>	 Sorry about that, ty!
[07:06:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno)
[07:06:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet
[07:06:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:06:43] <James_F>	 No worries at all!
[07:06:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet
[07:07:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet
[07:09:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212819 (10phaultfinder)
[07:10:56] <logmsgbot>	 jmm@cumin2002 drain-node (PID 180950) is awaiting input
[07:13:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:17:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet
[07:18:00] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[07:18:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ExperimentXLabManager: allow to re-enroll a user in experiments [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno)
[07:19:10] <James_F>	 Eurgh.
[07:19:48] <James_F>	 sergi0: Do the API tests sometimes fail like this for GrowthExperiments, or is this CI failure likely real?
[07:20:38] <James_F>	 I'll do the simple config patches whilst we work that out.
[07:20:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190702 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno)
[07:20:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189617 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx)
[07:20:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189616 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx)
[07:21:09] <sergi0>	 Looking into it, I had not seen that error on GE before
[07:21:21] <James_F>	 I can C+2 it again and see if it passes.
[07:21:49] <wikibugs>	 (03Merged) 10jenkins-bot: Growth [testwiki]: enable new notifications and reduce scheduling time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190702 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno)
[07:21:50] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] "Let's try this again." [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno)
[07:21:51] <wikibugs>	 (03Merged) 10jenkins-bot: mswikiquote: set timezone, sitename and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189617 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx)
[07:21:56] <wikibugs>	 (03Merged) 10jenkins-bot: mswikiquote: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189616 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx)
[07:22:02] <James_F>	 OK, first batch going ahead now.
[07:22:16] <wikibugs>	 (03PS1) 10Slyngshede: Revert "P:puppetserver::volatile Include XCheeseScore private repo" [puppet] - 10https://gerrit.wikimedia.org/r/1191239
[07:22:25] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1190702|Growth [testwiki]: enable new notifications and reduce scheduling time (T404085)]], [[gerrit:1189617|mswikiquote: set timezone, sitename and project namespace (T404700)]], [[gerrit:1189616|mswikiquote: add logo (T404700)]]
[07:22:34] <stashbot>	 T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085
[07:22:34] <stashbot>	 T404700: Post-creation work for mswikiquote - https://phabricator.wikimedia.org/T404700
[07:22:37] <James_F>	 sergi0: Please be ready to test the notifications on testwiki in a minute or two.
[07:23:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet
[07:23:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet
[07:24:03] <wikibugs>	 (03Merged) 10jenkins-bot: ExperimentXLabManager: allow to re-enroll a user in experiments [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno)
[07:24:03] <sergi0>	 James_F: ack
[07:24:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet
[07:24:24] <James_F>	 Aha, cool, next time I sync it'll pick that up.
[07:26:01] <anzx>	 James_F: mswikiquite logo and other changes looks good to sync
[07:26:15] <James_F>	 anzx: Cool, thank you.
[07:28:00] <wikibugs>	 (03PS1) 10Slyngshede: P::puppetserver::volatile fix xcheesescore repo path [puppet] - 10https://gerrit.wikimedia.org/r/1191240 (https://phabricator.wikimedia.org/T404688)
[07:28:08] <logmsgbot>	 jmm@cumin2002 drain-node (PID 188053) is awaiting input
[07:28:41] <wikibugs>	 (03PS4) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808)
[07:28:47] <logmsgbot>	 !log jforrester@deploy1003 anzx, jforrester, sgimeno: Backport for [[gerrit:1190702|Growth [testwiki]: enable new notifications and reduce scheduling time (T404085)]], [[gerrit:1189617|mswikiquote: set timezone, sitename and project namespace (T404700)]], [[gerrit:1189616|mswikiquote: add logo (T404700)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:28:56] <stashbot>	 T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085
[07:28:57] <stashbot>	 T404700: Post-creation work for mswikiquote - https://phabricator.wikimedia.org/T404700
[07:29:20] <James_F>	 sergi0: Please check.
[07:29:25] <sergi0>	 on it
[07:29:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet
[07:30:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212862 (10phaultfinder)
[07:30:15] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P::puppetserver::volatile fix xcheesescore repo path [puppet] - 10https://gerrit.wikimedia.org/r/1191240 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[07:30:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Add maps1012 to maps1014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565)
[07:31:12] <sergi0>	 James_F: lgtm
[07:31:15] <James_F>	 Ack.
[07:31:17] <logmsgbot>	 !log jforrester@deploy1003 anzx, jforrester, sgimeno: Continuing with sync
[07:33:00] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:34:32] <anzx>	 James_F: please run namespacedupes for mswikiquote after completion of sync
[07:34:41] <James_F>	 anzx: Ack.
[07:35:49] <wikibugs>	 (03PS1) 10Slyngshede: P:puppetserver::volatile  repo -> repos [puppet] - 10https://gerrit.wikimedia.org/r/1191243
[07:35:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet
[07:36:01] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190702|Growth [testwiki]: enable new notifications and reduce scheduling time (T404085)]], [[gerrit:1189617|mswikiquote: set timezone, sitename and project namespace (T404700)]], [[gerrit:1189616|mswikiquote: add logo (T404700)]] (duration: 13m 36s)
[07:36:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet
[07:36:10] <stashbot>	 T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085
[07:36:11] <stashbot>	 T404700: Post-creation work for mswikiquote - https://phabricator.wikimedia.org/T404700
[07:36:49] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1190698|ExperimentXLabManager: allow to re-enroll a user in experiments (T401308)]]
[07:36:55] <logmsgbot>	 !log jforrester@deploy1003 mwscript-k8s job started: namespaceDupes mswikiquote --fix  # T404700
[07:36:56] <stashbot>	 T401308: Create A/B test experiment for leveling up notifications - https://phabricator.wikimedia.org/T401308
[07:37:30] <anzx>	 James_F: thanks for deploying 
[07:37:35] <James_F>	 anzx: Would the list of moved pages be useful? None look surprising.
[07:38:04] <anzx>	 not required, all pages looks correctly moved
[07:38:12] <James_F>	 Excellent
[07:38:21] <sergi0>	 Thank you James!
[07:38:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet
[07:38:47] <James_F>	 Deploying to debug servers now.
[07:40:07] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile  repo -> repos [puppet] - 10https://gerrit.wikimedia.org/r/1191243 (owner: 10Slyngshede)
[07:42:42] <logmsgbot>	 !log jforrester@deploy1003 jforrester, sgimeno: Backport for [[gerrit:1190698|ExperimentXLabManager: allow to re-enroll a user in experiments (T401308)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:42:50] <stashbot>	 T401308: Create A/B test experiment for leveling up notifications - https://phabricator.wikimedia.org/T401308
[07:43:35] <James_F>	 sergi0: Please check.
[07:43:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet
[07:44:50] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:44:51] <wikibugs>	 (03PS1) 10Slyngshede: P:puppetserver::volatile xcheesescore main branch [puppet] - 10https://gerrit.wikimedia.org/r/1191246
[07:45:01] <sergi0>	 on it
[07:46:01] <James_F>	 Thanks.
[07:46:56] <wikibugs>	 (03PS1) 10Ryan Kemper: Simplify make_api_call function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191247
[07:46:56] <wikibugs>	 (03PS1) 10Ryan Kemper: Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248
[07:47:15] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile xcheesescore main branch [puppet] - 10https://gerrit.wikimedia.org/r/1191246 (owner: 10Slyngshede)
[07:49:45] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[07:49:48] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[07:49:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet
[07:49:58] <sergi0>	 James_F: good from my side
[07:49:59] <wikibugs>	 (03PS1) 10KartikMistry: cxserver: staging: Update to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191249 (https://phabricator.wikimedia.org/T394982)
[07:50:03] <logmsgbot>	 !log jforrester@deploy1003 jforrester, sgimeno: Continuing with sync
[07:50:05] <James_F>	 Cool.
[07:50:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet
[07:50:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet
[07:50:44] <James_F>	 Let's see how swiftly we can get the Graph removal landed.
[07:51:10] <wikibugs>	 (03PS3) 10Jforrester: Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317)
[07:51:41] <kart_>	 Minor cxserver deployment..
[07:52:00] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191249 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry)
[07:52:14] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[07:52:19] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[07:52:26] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[07:52:37] <Reedy>	 yay
[07:53:20] <wikibugs>	 (03Merged) 10jenkins-bot: Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester)
[07:53:44] <wikibugs>	 (03Merged) 10jenkins-bot: cxserver: staging: Update to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191249 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry)
[07:54:52] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190698|ExperimentXLabManager: allow to re-enroll a user in experiments (T401308)]] (duration: 18m 03s)
[07:54:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212915 (10phaultfinder)
[07:55:00] <stashbot>	 T401308: Create A/B test experiment for leveling up notifications - https://phabricator.wikimedia.org/T401308
[07:55:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet
[07:55:27] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1184797|Stop loading the Graph extension anywhere (T362317)]]
[07:55:32] <stashbot>	 T362317: Undeploy Graph from Wikimedia production wikis - https://phabricator.wikimedia.org/T362317
[07:55:45] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11212920 (10fgiunchedi) a:05dcaro→03fgiunchedi
[07:55:48] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[07:56:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 (owner: 10Ryan Kemper)
[07:56:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Simplify make_api_call function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191247 (owner: 10Ryan Kemper)
[07:56:28] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[07:56:52] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[07:56:57] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[07:58:02] <kart_>	 !log staging: Updated cxserver to 2025-09-25-074241-production (T394982)
[07:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:07] <stashbot>	 T394982: Migrate cxserver in production to node22 - https://phabricator.wikimedia.org/T394982
[07:58:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:59:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11212940 (10phaultfinder)
[08:01:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet
[08:01:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet
[08:01:38] <wikibugs>	 (03PS1) 10Slyngshede: D:git::clone add environment to pull command [puppet] - 10https://gerrit.wikimedia.org/r/1191291 (https://phabricator.wikimedia.org/T404688)
[08:01:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[08:04:35] <jinxer-wm>	 FIRING: DiskSpace: Disk space deploy1003:9100:/srv 3.632% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[08:04:50] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:42] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] idp: Add dummy data for airflow-wikidata [labs/private] - 10https://gerrit.wikimedia.org/r/1191190 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[08:07:00] <logmsgbot>	 jmm@cumin2002 drain-node (PID 217655) is awaiting input
[08:07:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:11:55] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] D:git::clone add environment to pull command [puppet] - 10https://gerrit.wikimedia.org/r/1191291 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[08:12:46] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] D:git::clone add environment to pull command [puppet] - 10https://gerrit.wikimedia.org/r/1191291 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[08:14:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet
[08:14:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11212984 (10Nikerabbit)
[08:14:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212985 (10phaultfinder)
[08:17:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Patch LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott)
[08:18:10] <wikibugs>	 (03CR) 10Reedy: [C:03+1] OATHAuth: Increase 2FA opt-in to 20% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191100 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[08:20:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet
[08:20:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[08:21:26] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1184797|Stop loading the Graph extension anywhere (T362317)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:21:31] <stashbot>	 T362317: Undeploy Graph from Wikimedia production wikis - https://phabricator.wikimedia.org/T362317
[08:22:02] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[08:22:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet
[08:23:08] <wikibugs>	 (03PS1) 10Tiziano Fogli: loki: increase ulimit nofile [puppet] - 10https://gerrit.wikimedia.org/r/1191300 (https://phabricator.wikimedia.org/T405552)
[08:24:54] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1030.eqiad.wmnet
[08:24:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet
[08:25:52] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1030.eqiad.wmnet
[08:26:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet
[08:26:33] <wikibugs>	 (03CR) 10Stevemunene: [V:03+2 C:03+2] idp: Add dummy data for airflow-wikidata [labs/private] - 10https://gerrit.wikimedia.org/r/1191190 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[08:30:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet
[08:34:41] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184797|Stop loading the Graph extension anywhere (T362317)]] (duration: 39m 14s)
[08:34:48] <stashbot>	 T362317: Undeploy Graph from Wikimedia production wikis - https://phabricator.wikimedia.org/T362317
[08:34:52] <James_F>	 Finally.
[08:36:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet
[08:36:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet
[08:38:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:39:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[08:42:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet
[08:43:58] <wikibugs>	 (03PS1) 10Ryan Kemper: Remove test_flush_markers_on_clusters_fail_synced [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191303
[08:43:58] <wikibugs>	 (03PS1) 10Ryan Kemper: Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304
[08:44:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:48:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet
[08:49:04] <wikibugs>	 (03PS2) 10Ryan Kemper: Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304
[08:49:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[08:49:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:49:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet
[08:49:56] <wikibugs>	 (03CR) 10Elukey: [C:03+1] wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366) (owner: 10RLazarus)
[08:50:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11213132 (10phaultfinder)
[08:51:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet
[08:52:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove test_flush_markers_on_clusters_fail_synced [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191303 (owner: 10Ryan Kemper)
[08:53:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304 (owner: 10Ryan Kemper)
[08:57:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet
[08:57:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet
[08:58:02] <wikibugs>	 (03CR) 10Elukey: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[08:58:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304 (owner: 10Ryan Kemper)
[08:58:50] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] "self-merge to fix the service" [puppet] - 10https://gerrit.wikimedia.org/r/1191300 (https://phabricator.wikimedia.org/T405552) (owner: 10Tiziano Fogli)
[08:59:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet
[08:59:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:01:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:02:19] <logmsgbot>	 jmm@cumin2002 drain-node (PID 246015) is awaiting input
[09:04:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:06:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet
[09:06:46] <wikibugs>	 (03PS1) 10Btullis: Remove the existing spark-operator release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191136 (https://phabricator.wikimedia.org/T405490)
[09:06:50] <wikibugs>	 (03PS1) 10Btullis: Remove our custom spark-operator helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191137 (https://phabricator.wikimedia.org/T405490)
[09:06:54] <wikibugs>	 (03PS1) 10Btullis: Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490)
[09:06:58] <wikibugs>	 (03PS2) 10Btullis: Import the upstream spark-operator chart version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490)
[09:07:05] <wikibugs>	 (03PS2) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490)
[09:07:11] <wikibugs>	 (03PS2) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490)
[09:11:07] <wikibugs>	 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11213197 (10jijiki) My conce...
[09:12:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet
[09:12:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet
[09:12:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:13:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet
[09:13:16] <wikibugs>	 (03CR) 10Elukey: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[09:14:02] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for newly created arbcom_plwiki - https://phabricator.wikimedia.org/T405543#11213203 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Done, and they look correct to me: ` root@ms-fe2009:~# for i in...
[09:17:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet
[09:17:17] <wikibugs>	 (03CR) 10Cathal Mooney: "Ha thanks yep that's all I need.  Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[09:18:07] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'.
[09:18:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia: ESI-LAG configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[09:18:38] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[09:19:55] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: ESI-LAG configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[09:19:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11213226 (10phaultfinder)
[09:20:04] <wikibugs>	 (03PS15) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470)
[09:20:04] <wikibugs>	 (03CR) 10Arnaudb: "the goal of this CR is:" [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[09:20:50] <wikibugs>	 (03PS16) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470)
[09:22:36] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: move kartotherian and tegola to the new codfw stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190578 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[09:23:03] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[09:23:13] <Dreamy_Jazz>	 jouncebot: next
[09:23:13] <jouncebot>	 In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1000)
[09:23:29] <wikibugs>	 (03PS2) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490)
[09:24:30] <wikibugs>	 (03PS1) 10Ryan Kemper: WIP: rewriting test_force_allocation_of_all_unassigned_shards [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191310
[09:24:44] <wikibugs>	 (03PS1) 10Mvolz: Update zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191311
[09:25:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet
[09:25:09] <wikibugs>	 (03CR) 10Btullis: "The CI is failing with the following error:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[09:25:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet
[09:27:16] <wikibugs>	 (03PS1) 10Dreamy Jazz: CheckUser: Enable SI special page on enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556)
[09:27:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) (owner: 10Dreamy Jazz)
[09:29:19] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[09:31:21] <wikibugs>	 (03PS1) 10Elukey: services: move tegola's codfw postgres config to the new stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191314 (https://phabricator.wikimedia.org/T381565)
[09:31:49] <wikibugs>	 (03PS2) 10Btullis: Remove the existing spark-operator release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191136 (https://phabricator.wikimedia.org/T405490)
[09:31:50] <wikibugs>	 (03PS2) 10Btullis: Remove our custom spark-operator helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191137 (https://phabricator.wikimedia.org/T405490)
[09:31:50] <wikibugs>	 (03PS2) 10Btullis: Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490)
[09:31:50] <wikibugs>	 (03PS3) 10Btullis: Import the upstream spark-operator chart version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490)
[09:31:51] <wikibugs>	 (03PS3) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490)
[09:31:55] <wikibugs>	 (03PS3) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490)
[09:31:59] <wikibugs>	 (03PS3) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490)
[09:33:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558 (10cmooney) 03NEW p:05Triage→03Medium
[09:33:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: rewriting test_force_allocation_of_all_unassigned_shards [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191310 (owner: 10Ryan Kemper)
[09:33:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11213294 (10cmooney)
[09:33:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11213293 (10cmooney)
[09:34:40] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11213300 (10taavi) 05Open→03Resolved
[09:38:11] <wikibugs>	 (03PS2) 10Brouberol: Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 (owner: 10Ryan Kemper)
[09:38:13] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[09:38:51] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[09:39:02] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: move tegola's codfw postgres config to the new stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191314 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[09:39:24] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[09:39:44] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[09:40:04] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[09:40:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[09:41:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560 (10cmooney) 03NEW
[09:41:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11213334 (10cmooney)
[09:41:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11213333 (10cmooney)
[09:42:14] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[09:44:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562 (10cmooney) 03NEW p:05Triage→03Medium
[09:44:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11213369 (10cmooney)
[09:44:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11213370 (10cmooney)
[09:47:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 (owner: 10Ryan Kemper)
[09:52:18] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[09:52:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11213413 (10cmooney)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1000)
[10:01:20] <wikibugs>	 (03PS1) 10Slyngshede: P:idp re-add NDA group for Netbox OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494)
[10:01:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:02:45] <wikibugs>	 (03PS1) 10Pmiazga: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405544)
[10:02:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405544) (owner: 10Pmiazga)
[10:09:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11213498 (10phaultfinder)
[10:09:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11213499 (10phaultfinder)
[10:14:50] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[10:16:44] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:19:44] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:19:50] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[10:19:58] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:24:50] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[10:27:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11213539 (10elukey) >>! In T394357#11193379, @Jhancock.wm wrote: > anything i can try onsite to help?  @Jhancock.wm not sure, I tried to upgrade the BIOS/BMC firmware + BMC reset...
[10:28:08] * elukey lunch!
[10:28:14] <elukey>	 wrong chan :D
[10:32:26] <wikibugs>	 (03PS2) 10Slyngshede: P:idp add ops group for Netbox OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494)
[10:33:24] <wikibugs>	 (03PS3) 10Slyngshede: P:idp add ops group for Netbox OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494)
[10:34:48] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] idp: Register airflow-wikidata IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1190979 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[10:40:31] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde: Remove unused graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/1191322
[10:41:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics::wmde: Remove unused graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/1191322 (owner: 10Lucas Werkmeister (WMDE))
[10:41:23] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Disclaimer: I know very little puppet and don’t know if this change is correct, please review with caution! All I know is that we don’t ne" [puppet] - 10https://gerrit.wikimedia.org/r/1191322 (owner: 10Lucas Werkmeister (WMDE))
[10:42:18] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): statistics::wmde: Remove unused graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/1191322
[10:43:13] <wikibugs>	 (03PS4) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490)
[10:43:13] <wikibugs>	 (03PS4) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490)
[10:43:13] <wikibugs>	 (03PS4) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490)
[10:44:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11213592 (10phaultfinder)
[10:45:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: interface: new define for additional IPs [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681)
[10:45:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681)
[10:46:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] interface: new define for additional IPs [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi)
[10:46:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi)
[10:50:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11213609 (10phaultfinder)
[10:53:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: interface: new define for additional IPs [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681)
[10:53:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681)
[10:53:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi)
[10:55:05] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191333 (https://phabricator.wikimedia.org/T405368)
[10:56:21] <wikibugs>	 (03PS3) 10Filippo Giunchedi: wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681)
[10:57:53] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for newly created arbcom_plwiki - https://phabricator.wikimedia.org/T405543#11213639 (10Superpes15)
[10:58:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:59:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11213650 (10phaultfinder)
[10:59:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11213649 (10phaultfinder)
[11:00:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede)
[11:01:02] <wikibugs>	 (03PS1) 10Sergio Gimeno: fix: provide a eventType fallback for already scheduled jobs [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191334 (https://phabricator.wikimedia.org/T405514)
[11:01:23] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for newly created arbcom_plwiki - https://phabricator.wikimedia.org/T405543#11213658 (10MatthewVernon)
[11:07:11] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idp add ops group for Netbox OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede)
[11:09:06] <wikibugs>	 (03PS1) 10WMDE-Fisch: Fix subref attribute order [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363)
[11:09:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11213686 (10phaultfinder)
[11:12:16] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Add airflow-wikidata namespace in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190974 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[11:14:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11213721 (10phaultfinder)
[11:15:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11213722 (10phaultfinder)
[11:18:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet
[11:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: Add airflow-wikidata namespace in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190974 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[11:21:10] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Fix subref attribute order [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) (owner: 10WMDE-Fisch)
[11:21:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet
[11:21:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye
[11:21:26] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1210.eqiad.wmnet with OS bullseye
[11:22:58] <wikibugs>	 (03PS1) 10Hnowlan: (api|rest)-gateway: Add option to disable CSP, disable for rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368)
[11:24:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11213772 (10phaultfinder)
[11:25:29] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191333 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[11:25:42] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:26:44] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:29:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1036.eqiad.wmnet
[11:29:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet
[11:37:05] <wikibugs>	 (03CR) 10A smart kitten: "(Just FYI, this apparently shouldn't've been merged as far before https://gerrit.wikimedia.org/r/1190347 as it was; see T405313#11210087)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[11:38:42] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:40:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet
[11:41:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:42:14] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw for all core sections
[11:42:30] <wikibugs>	 (03PS2) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga)
[11:42:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga)
[11:42:55] <wikibugs>	 (03PS3) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga)
[11:43:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga)
[11:44:50] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:45:14] <wikibugs>	 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11213867 (10kostajh) >>! In...
[11:45:49] <logmsgbot>	 jmm@cumin2002 drain-node (PID 323692) is awaiting input
[11:47:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet
[11:47:35] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw for all core sections
[11:48:00] <wikibugs>	 (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[11:51:02] <wikibugs>	 (03PS19) 10Brouberol: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper)
[11:52:01] <wikibugs>	 (03Abandoned) 10Brouberol: Fix linting errors [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189764 (owner: 10Brouberol)
[11:52:05] <wikibugs>	 (03Abandoned) 10Brouberol: Fix test_flush_markers_on_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189765 (owner: 10Brouberol)
[11:52:10] <wikibugs>	 (03Abandoned) 10Brouberol: Pass the timeout to the underlying http client [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189766 (owner: 10Brouberol)
[11:52:15] <wikibugs>	 (03Abandoned) 10Brouberol: Simplify make_api_call function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191247 (owner: 10Ryan Kemper)
[11:52:20] <wikibugs>	 (03Abandoned) 10Brouberol: Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 (owner: 10Ryan Kemper)
[11:52:24] <wikibugs>	 (03Abandoned) 10Brouberol: Remove test_flush_markers_on_clusters_fail_synced [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191303 (owner: 10Ryan Kemper)
[11:52:28] <wikibugs>	 (03Abandoned) 10Brouberol: Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304 (owner: 10Ryan Kemper)
[11:52:32] <wikibugs>	 (03Abandoned) 10Brouberol: WIP: rewriting test_force_allocation_of_all_unassigned_shards [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191310 (owner: 10Ryan Kemper)
[11:54:04] <wikibugs>	 (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[11:55:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11213912 (10phaultfinder)
[11:55:31] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): CheckUser: Enable SI special page on enwiki and frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) (owner: 10Dreamy Jazz)
[11:57:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet
[11:57:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet
[11:57:13] <wikibugs>	 (03PS2) 10Dreamy Jazz: CheckUser: Enable SI special page on enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556)
[11:58:06] <wikibugs>	 (03PS1) 10Stevemunene: admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073)
[11:58:27] <wikibugs>	 (03PS1) 10D3r1ck01: objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191350
[11:58:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[11:58:54] <wikibugs>	 (03PS1) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195)
[11:59:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1200)
[12:04:24] <wikibugs>	 (03PS2) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195)
[12:04:50] <jinxer-wm>	 FIRING: DiskSpace: Disk space deploy1003:9100:/srv 3.532% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[12:04:50] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:05:40] <wikibugs>	 (03PS3) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195)
[12:06:24] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191069 (owner: 10PipelineBot)
[12:06:26] <wikibugs>	 (03PS20) 10Brouberol: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper)
[12:08:24] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191069 (owner: 10PipelineBot)
[12:09:25] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[12:09:45] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[12:09:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 (10cmooney) 03NEW p:05Triage→03Medium
[12:09:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214001 (10phaultfinder)
[12:10:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11214002 (10cmooney)
[12:10:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11214003 (10cmooney)
[12:10:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet
[12:10:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11214006 (10cmooney)
[12:13:46] <wikibugs>	 (03PS2) 10JMeybohm: haproxy ipblocks-all: Filter disabled ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/1190274 (https://phabricator.wikimedia.org/T402014)
[12:14:05] <wikibugs>	 (03PS1) 10D3r1ck01: objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191359
[12:14:23] <wikibugs>	 (03CR) 10JMeybohm: "I did mess up the range loop in the last patchset. Corrected that as well." [puppet] - 10https://gerrit.wikimedia.org/r/1190274 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm)
[12:14:27] <wikibugs>	 (03PS1) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195)
[12:14:35] <wikibugs>	 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11214010 (10Reedy) >>! In T4...
[12:14:52] <wikibugs>	 (03PS2) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195)
[12:15:05] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[12:15:19] <wikibugs>	 (03PS1) 10D3r1ck01: hCaptcha: Fix mock for StatsFactory [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191361
[12:15:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01)
[12:15:32] <wikibugs>	 (03PS1) 10D3r1ck01: NewcomerTasks: Use StatsFactory unit test helper [extensions/GrowthExperiments] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191362
[12:15:32] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[12:15:42] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[12:16:01] <wikibugs>	 (03PS1) 10Slyngshede: P:openldap::management add netbox-readonly-access to offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494)
[12:16:10] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191364 (https://phabricator.wikimedia.org/T394982)
[12:16:12] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[12:16:17] <logmsgbot>	 jmm@cumin2002 drain-node (PID 338228) is awaiting input
[12:19:10] <wikibugs>	 (03CR) 10D3r1ck01: "recheck" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01)
[12:19:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11214030 (10cmooney)
[12:19:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580 (10elukey) 03NEW p:05Triage→03Unbreak!
[12:23:10] <Dreamy_Jazz>	 jouncebot: nowandnext
[12:23:10] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1200)
[12:23:10] <jouncebot>	 In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1300)
[12:25:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191359 (owner: 10D3r1ck01)
[12:25:18] <wikibugs>	 (03PS1) 10DDesouza: Pre-deploy design research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577)
[12:25:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01)
[12:25:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191362 (owner: 10D3r1ck01)
[12:26:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191361 (owner: 10D3r1ck01)
[12:26:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza)
[12:26:41] <Dreamy_Jazz>	 I'm going to deploy now
[12:26:47] <Dreamy_Jazz>	 If anyone else isn't already
[12:26:58] <wikibugs>	 (03PS2) 10DDesouza: Pre-deploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577)
[12:27:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191350 (owner: 10D3r1ck01)
[12:27:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01)
[12:28:28] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191374
[12:28:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11214065 (10elukey) This may be the root cause:  ` 2025-09-24T20:05:13.401802+00:00 puppetserver1001 sudo : TTY=pts/6 ; PWD=/home/denisse ; USER=root ; COMMAND=/usr/bin/puppet ss...
[12:29:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11214071 (10phaultfinder)
[12:31:33] <wikibugs>	 (03PS1) 10Sbisson: Special:Contribute: configure new page target title for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191377 (https://phabricator.wikimedia.org/T327063)
[12:33:53] <wikibugs>	 (03PS1) 10D3r1ck01: Enable multibackend session store on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808)
[12:34:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11214082 (10cmooney) @Jclark-ctr @VRiley-WMF I may have missed to check we have the cables needed for these already.  We're re-using exsiting...
[12:34:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) (owner: 10WMDE-Fisch)
[12:34:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11214084 (10cmooney) a:03cmooney
[12:34:44] <wikibugs>	 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11214083 (10kostajh) >>! In...
[12:34:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214086 (10phaultfinder)
[12:35:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[12:36:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:37:30] <wikibugs>	 (03PS2) 10D3r1ck01: Enable multibackend session store on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808)
[12:38:11] <logmsgbot>	 !log elukey@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for puppetserver1001.eqiad.wmnet: Renew puppet certificate - elukey@cumin1003
[12:38:35] <wikibugs>	 (03PS5) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808)
[12:38:42] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:43:38] <Dreamy_Jazz>	 I'm actively deploying changes to private code and then will deploy the public config patch
[12:44:49] <wikibugs>	 (03PS1) 10Elukey: sre.puppet.renew-cert: skip destroy when needed. [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580)
[12:45:29] <moritzm>	 kubestagemaster1003 will do down for a Ganeti node reboot
[12:45:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet
[12:46:25] <wikibugs>	 (03CR) 10Muehlenhoff: "Typo inline, otherwise LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey)
[12:47:15] <wikibugs>	 (03PS2) 10Elukey: sre.puppet.renew-cert: skip destroy when needed. [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580)
[12:47:29] <wikibugs>	 (03CR) 10Elukey: sre.puppet.renew-cert: skip destroy when needed. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey)
[12:47:36] <icinga-wm>	 PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:49:43] <jynus>	 !log swap read only for db1176/db2230 (test-s4) T403966
[12:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:49] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[12:49:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11214114 (10phaultfinder)
[12:50:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey)
[12:51:04] <icinga-wm>	 RECOVERY - Host kubestagemaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms
[12:51:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:53:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet
[12:53:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet
[12:54:20] <logmsgbot>	 !log elukey@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for puppetserver1001.eqiad.wmnet: Renew puppet certificate - elukey@cumin1003
[12:55:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet
[12:56:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:57:12] <wikibugs>	 (03CR) 10Muehlenhoff: "Please also update modules/admin/data/nda_groups.txt (evntually I'll fix the various scripts to read the list from there, but for now we n" [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede)
[12:58:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) (owner: 10Dreamy Jazz)
[12:59:37] <wikibugs>	 (03Merged) 10jenkins-bot: CheckUser: Enable SI special page on enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) (owner: 10Dreamy Jazz)
[12:59:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214162 (10phaultfinder)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1300).
[13:00:05] <jouncebot>	 Dreamy_Jazz, tgr, danisztls, and WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1191312|CheckUser: Enable SI special page on enwiki and frwiki (T405556)]]
[13:00:11] <stashbot>	 T405556: Suggested investigations: Enable special page on English and French Wikipedia - https://phabricator.wikimedia.org/T405556
[13:00:14] <Lucas_WMDE>	 o/
[13:00:17] <WMDE-Fisch>	 o/
[13:01:02] <Dreamy_Jazz>	 \o
[13:01:07] <Dreamy_Jazz>	 I am self deploying my backports
[13:01:28] <tgr_>	 o/
[13:01:29] <logmsgbot>	 jmm@cumin2002 drain-node (PID 359050) is awaiting input
[13:01:30] <Dreamy_Jazz>	 Anyone that needs to merge a backport that will take a while in CI (like the core changes) should be safe to start now
[13:01:36] <Dreamy_Jazz>	 As this config change shouldn't be reverted
[13:01:46] <Dreamy_Jazz>	 *reverted at the test stage
[13:01:52] <wikibugs>	 (03PS1) 10Fabfur: haproxy:cache: discard requests w/o Host header [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456)
[13:02:06] <Dreamy_Jazz>	 I will be done after the one that I am currently deploying
[13:02:22] <tgr_>	 thx, will do that then
[13:02:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet
[13:02:47] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191359 (owner: 10D3r1ck01)
[13:02:49] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01)
[13:02:50] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] NewcomerTasks: Use StatsFactory unit test helper [extensions/GrowthExperiments] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191362 (owner: 10D3r1ck01)
[13:02:53] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] hCaptcha: Fix mock for StatsFactory [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191361 (owner: 10D3r1ck01)
[13:02:59] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191350 (owner: 10D3r1ck01)
[13:03:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11214172 (10elukey) Tried to run the cookbook to renew the cert with a slight modification to skip the initial destroy. It failed when waiting for the new C...
[13:03:09] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01)
[13:03:13] <wikibugs>	 (03PS1) 10Santiago Faci: xLab: Deploying v1.0.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191393 (https://phabricator.wikimedia.org/T385180)
[13:03:53] <wikibugs>	 (03CR) 10Majavah: [C:03+1] "This does fix the immediate issue and seems like a reasonable stop-gap unless/until we get around to converting everything everywhere to n" [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi)
[13:04:06] <WMDE-Fisch>	 tgr_: Feel free to +2 mine to ;-)
[13:04:14] <wikibugs>	 (03CR) 10Majavah: [C:04-1] wmcs: have additional IPs survive reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi)
[13:04:38] <wikibugs>	 (03PS4) 10Filippo Giunchedi: wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681)
[13:04:51] <danisztls>	 o/ I can self-deploy
[13:07:41] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1191312|CheckUser: Enable SI special page on enwiki and frwiki (T405556)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:07:47] <stashbot>	 T405556: Suggested investigations: Enable special page on English and French Wikipedia - https://phabricator.wikimedia.org/T405556
[13:08:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet
[13:08:06] <wikibugs>	 (03Merged) 10jenkins-bot: objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191359 (owner: 10D3r1ck01)
[13:08:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet
[13:08:31] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11214225 (10Tgr) >>! In T122097#11212641, @Krinkle wrote: > setting/changing a cookie is equivalent to discarding the br...
[13:09:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.352s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:09:20] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[13:09:23] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Fix subref attribute order [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) (owner: 10WMDE-Fisch)
[13:10:47] <wikibugs>	 (03PS2) 10Fabfur: haproxy:cache: discard http1.0 requests [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456)
[13:14:09] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191312|CheckUser: Enable SI special page on enwiki and frwiki (T405556)]] (duration: 14m 04s)
[13:14:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.352s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:14:16] <stashbot>	 T405556: Suggested investigations: Enable special page on English and French Wikipedia - https://phabricator.wikimedia.org/T405556
[13:14:19] <Dreamy_Jazz>	 Handing off to the next person
[13:14:23] <Dreamy_Jazz>	 tgr_:
[13:14:36] <wikibugs>	 (03Merged) 10jenkins-bot: NewcomerTasks: Use StatsFactory unit test helper [extensions/GrowthExperiments] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191362 (owner: 10D3r1ck01)
[13:15:25] <tgr_>	 thx
[13:15:44] <tgr_>	 WMDE-Fisch: I'm deploying your backport as well then
[13:15:54] <WMDE-Fisch>	 tgr_: thanks yes
[13:16:36] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Fix mock for StatsFactory [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191361 (owner: 10D3r1ck01)
[13:16:39] <wikibugs>	 (03Merged) 10jenkins-bot: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01)
[13:16:45] <wikibugs>	 (03Merged) 10jenkins-bot: objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191350 (owner: 10D3r1ck01)
[13:16:50] <wikibugs>	 (03Merged) 10jenkins-bot: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01)
[13:17:10] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur)
[13:17:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet
[13:19:15] <wikibugs>	 (03PS3) 10Fabfur: haproxy:cache: discard http1.0 requests [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456)
[13:19:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11214262 (10phaultfinder)
[13:19:58] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur)
[13:19:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214261 (10phaultfinder)
[13:20:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet
[13:24:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11214287 (10phaultfinder)
[13:25:21] <wikibugs>	 (03Merged) 10jenkins-bot: Fix subref attribute order [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) (owner: 10WMDE-Fisch)
[13:26:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet
[13:26:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet
[13:27:03] <logmsgbot>	 !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1191359|objectcache: Add a hit/miss flag to CachedBagOStuff]], [[gerrit:1191360|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191361|hCaptcha: Fix mock for StatsFactory]], [[gerrit:1191362|NewcomerTasks: Use StatsFactory unit test helper]], [[gerrit:1191350|objectcache: Add a hit/miss flag to CachedB
[13:27:03] <logmsgbot>	 agOStuff]], [[gerrit:1191351|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191341|Fix subref attribute order (T389363)]]
[13:27:11] <stashbot>	 T399195: Update logging and monitoring for multiple session storage backends - https://phabricator.wikimedia.org/T399195
[13:27:12] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[13:27:13] <stashbot>	 T389363: Fix attribute order round-tripping for sub-references (dirty diff) - https://phabricator.wikimedia.org/T389363
[13:29:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet
[13:31:49] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] [eventgate-*] Bump to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191234 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin)
[13:32:43] <wikibugs>	 (03PS1) 10DDesouza: Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410)
[13:33:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[13:33:33] <logmsgbot>	 !log tgr@deploy1003 d3r1ck01, wmde-fisch, tgr: Backport for [[gerrit:1191359|objectcache: Add a hit/miss flag to CachedBagOStuff]], [[gerrit:1191360|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191361|hCaptcha: Fix mock for StatsFactory]], [[gerrit:1191362|NewcomerTasks: Use StatsFactory unit test helper]], [[gerrit:1191350|objectcache: Add a hit/miss flag to Cache
[13:33:33] <logmsgbot>	 dBagOStuff]], [[gerrit:1191351|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191341|Fix subref attribute order (T389363)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:33:42] <stashbot>	 T399195: Update logging and monitoring for multiple session storage backends - https://phabricator.wikimedia.org/T399195
[13:33:43] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[13:33:43] <stashbot>	 T389363: Fix attribute order round-tripping for sub-references (dirty diff) - https://phabricator.wikimedia.org/T389363
[13:33:44] * WMDE-Fisch testing
[13:33:56] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[13:35:01] <logmsgbot>	 jmm@cumin2002 drain-node (PID 377806) is awaiting input
[13:38:12] <WMDE-Fisch>	 tgr_: I'm fine
[13:38:25] <logmsgbot>	 !log tgr@deploy1003 d3r1ck01, wmde-fisch, tgr: Continuing with sync
[13:38:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11214327 (10taavi) `lang=shell-session taavi@puppetserver1001 ~ $ sudo mv /var/lib/puppet/ssl/private_keys/puppetserver1001.eqiad.wmnet.pem /root/puppetserv...
[13:39:50] <wikibugs>	 (03PS2) 10DDesouza: Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410)
[13:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:40:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[13:43:12] <logmsgbot>	 !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191359|objectcache: Add a hit/miss flag to CachedBagOStuff]], [[gerrit:1191360|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191361|hCaptcha: Fix mock for StatsFactory]], [[gerrit:1191362|NewcomerTasks: Use StatsFactory unit test helper]], [[gerrit:1191350|objectcache: Add a hit/miss flag to Cached
[13:43:12] <logmsgbot>	 BagOStuff]], [[gerrit:1191351|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191341|Fix subref attribute order (T389363)]] (duration: 16m 09s)
[13:43:20] <stashbot>	 T399195: Update logging and monitoring for multiple session storage backends - https://phabricator.wikimedia.org/T399195
[13:43:21] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[13:43:22] <stashbot>	 T389363: Fix attribute order round-tripping for sub-references (dirty diff) - https://phabricator.wikimedia.org/T389363
[13:44:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11214358 (10elukey) p:05Unbreak!→03High The above fix worked, really nice save @taavi!  Next steps (imho):  * Create a simple cookbook to clean up certs...
[13:44:39] <wikibugs>	 (03CR) 10Ssingh: lvs1018: remove L2 sub-interface config for row E/F vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191109 (https://phabricator.wikimedia.org/T405499) (owner: 10Cathal Mooney)
[13:44:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214360 (10phaultfinder)
[13:46:02] <tgr_>	 danisztls: should I deploy 1191370 along with the other config change?
[13:46:15] <danisztls>	 tgr_: yes, please
[13:46:23] <danisztls>	 I can also self-deploy if you prefer
[13:46:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[13:46:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza)
[13:47:38] <wikibugs>	 (03Merged) 10jenkins-bot: Enable multibackend session store on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[13:47:46] <wikibugs>	 (03Merged) 10jenkins-bot: Pre-deploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza)
[13:47:58] <wikibugs>	 (03PS3) 10DDesouza: Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410)
[13:48:12] <logmsgbot>	 !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1191378|Enable multibackend session store on beta and testwiki (T402808)]], [[gerrit:1191370|Pre-deploy Design Research participant recruitment survey on jawiki (T405577)]]
[13:48:20] <stashbot>	 T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577
[13:48:50] <moritzm>	 kubestagemaster1003 and dse-k8s-etcd1002 will do down for a Ganeti node reboot
[13:48:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet
[13:49:01] <wikibugs>	 (03PS1) 10Michael Große: fix: prevent type-error from outdated serialization [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191414 (https://phabricator.wikimedia.org/T405511)
[13:50:24] <icinga-wm>	 PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:50:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: upload_puppet_facts.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:50:42] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:52:16] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[13:52:28] <wikibugs>	 (03CR) 10Ssingh: haproxy:cache: discard http1.0 requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur)
[13:53:42] <wikibugs>	 (03Merged) 10jenkins-bot: Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[13:54:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet
[13:54:44] <logmsgbot>	 !log tgr@deploy1003 tgr, d3r1ck01, dani: Backport for [[gerrit:1191378|Enable multibackend session store on beta and testwiki (T402808)]], [[gerrit:1191370|Pre-deploy Design Research participant recruitment survey on jawiki (T405577)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:54:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet
[13:54:50] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:54:51] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[13:54:52] <stashbot>	 T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577
[13:55:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11214432 (10phaultfinder)
[13:55:28] <icinga-wm>	 RECOVERY - Host kubestagemaster1004 is UP: PING WARNING - Packet loss = 33%, RTA = 0.48 ms
[13:55:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye
[13:55:44] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms
[13:55:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:56:05] <danisztls>	 tgr_: looks good
[13:59:09] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:59:25] <logmsgbot>	 !log tgr@deploy1003 tgr, d3r1ck01, dani: Continuing with sync
[13:59:42] <wikibugs>	 (03CR) 10Fabfur: haproxy:cache: discard http1.0 requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur)
[13:59:50] <wikibugs>	 (03CR) 10Ssingh: "Looks good but let's run varnish tests on this one before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur)
[13:59:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11214445 (10phaultfinder)
[14:00:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet
[14:00:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:04:24] <logmsgbot>	 !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191378|Enable multibackend session store on beta and testwiki (T402808)]], [[gerrit:1191370|Pre-deploy Design Research participant recruitment survey on jawiki (T405577)]] (duration: 16m 11s)
[14:04:32] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[14:04:33] <stashbot>	 T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577
[14:06:13] <tgr_>	 !log UTC afternoon deploys done
[14:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:20] <moritzm>	 ml-etcd1001 will do down for a Ganeti node reboot
[14:06:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet
[14:08:51] <wikibugs>	 (03PS1) 10CDanis: Search inside inline pattern values [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1191424
[14:08:52] <icinga-wm>	 PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:41] <wikibugs>	 (03PS1) 10DDesouza: Deploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191425 (https://phabricator.wikimedia.org/T405577)
[14:10:07] <wikibugs>	 (03CR) 10CDanis: [V:03+2 C:03+2] Search inside inline pattern values [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1191424 (owner: 10CDanis)
[14:10:26] <icinga-wm>	 RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms
[14:10:29] <logmsgbot>	 !log cdanis@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "feat: search inside inline pattern values - cdanis@cumin1003"
[14:10:31] <logmsgbot>	 !log cdanis@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: search inside inline pattern values - cdanis@cumin1003
[14:11:20] <logmsgbot>	 !log cdanis@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: search inside inline pattern values - cdanis@cumin1003
[14:11:21] <logmsgbot>	 !log cdanis@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "feat: search inside inline pattern values - cdanis@cumin1003"
[14:11:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet
[14:11:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet
[14:12:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet
[14:13:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet
[14:14:04] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] haproxy:cache: discard http1.0 requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur)
[14:14:50] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[14:14:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11214574 (10phaultfinder)
[14:15:24] <wikibugs>	 (03CR) 10Bking: opensearch-operator: move WMF-specific values to chart values.yaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190343 (https://phabricator.wikimedia.org/T404906) (owner: 10Bking)
[14:16:02] <phuedx>	 jouncebot nowandnext
[14:16:02] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[14:16:02] <jouncebot>	 In 0 hour(s) and 13 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1430)
[14:16:44] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:16:45] <wikibugs>	 (03PS1) 10DDesouza: Reader foundational on enwiki (beta): Add additional config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410)
[14:17:46] <wikibugs>	 (03PS2) 10Cathal Mooney: Nokia: support mixing of L2 and L3 subinterfaces on SR Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577)
[14:18:37] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] Reader foundational on enwiki (beta): Add additional config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[14:18:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet
[14:18:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet
[14:19:35] <wikibugs>	 (03Merged) 10jenkins-bot: Reader foundational on enwiki (beta): Add additional config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[14:19:44] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:19:50] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[14:19:52] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::haproxy add dummy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1191428
[14:19:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214618 (10phaultfinder)
[14:19:58] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:20:38] <wikibugs>	 (03PS1) 10Cathal Mooney: ssw1-d8-eqiad: add bgp peerings to CR and Juniper spines [homer/public] - 10https://gerrit.wikimedia.org/r/1191429 (https://phabricator.wikimedia.org/T396063)
[14:22:19] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum5001.eqsin.wmnet with OS trixie
[14:22:34] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11214629 (10Krinkle) >>! In T122097#11214225, @Tgr wrote: >>>! In T122097#11212641, @Krinkle wrote: >> setting/changing...
[14:22:58] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS trixie
[14:24:05] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1191428 (owner: 10Slyngshede)
[14:24:50] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:24:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11214651 (10phaultfinder)
[14:25:35] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy add dummy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1191428 (owner: 10Slyngshede)
[14:26:11] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy:cache: discard http1.0 requests [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur)
[14:27:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:27:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191425 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1430)
[14:30:15] <wikibugs>	 (03PS4) 10Jasmine: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891)
[14:32:06] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[14:33:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602 (10cmooney) 03NEW p:05Triage→03Medium
[14:37:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp1107.eqiad.wmnet are marked down but pooled: uploadlb_443: Servers cp1107.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:37:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp1107.eqiad.wmnet are marked down but pooled: uploadlb_443: Servers cp1107.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:37:27] <sukhe>	 uh?
[14:37:30] <sukhe>	 fabfur: ^
[14:37:49] <sukhe>	 is anyone working on cp1107?
[14:38:14] <sukhe>	 not really
[14:38:28] <fabfur>	 mmm
[14:38:35] <sukhe>	 fabfur: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/c936de676283b3d9e2ec4af46a82c570f7f98974
[14:38:41] <sukhe>	 can this be related to the above?
[14:38:48] <sukhe>	 I don't see how but maybe the checks?
[14:38:58] <fabfur>	 pybal does it issue http1.0 checks???
[14:39:02] <fabfur>	 ok let me revert it
[14:39:09] <sukhe>	 yeah let's revert
[14:39:17] <sukhe>	 and then look
[14:39:19] <sukhe>	 before it gets worse
[14:39:33] <phuedx>	 We do actually have an experimentation-related deployment to do. sukhe, fabfur: Would that be OK or do those alerts block a deployment?
[14:39:46] <wikibugs>	 (03PS1) 10Fabfur: Revert "haproxy:cache: discard http1.0 requests" [puppet] - 10https://gerrit.wikimedia.org/r/1191432
[14:39:58] <sukhe>	 phuedx: if you can wait for five minutes, that might be helpful
[14:39:58] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "haproxy:cache: discard http1.0 requests" [puppet] - 10https://gerrit.wikimedia.org/r/1191432 (owner: 10Fabfur)
[14:40:00] <wikibugs>	 (03CR) 10Fabfur: [V:03+2 C:03+2] Revert "haproxy:cache: discard http1.0 requests" [puppet] - 10https://gerrit.wikimedia.org/r/1191432 (owner: 10Fabfur)
[14:40:01] <sukhe>	 just to rule this out
[14:40:04] <phuedx>	 sukhe: ACK
[14:40:13] <fabfur>	 revert submitted
[14:40:15] <sukhe>	 fabfur: let me know when puppet merge finishes
[14:40:23] <sukhe>	 I will run agent on 1107
[14:40:57] <sukhe>	 it might very well be red herring but yes, let's not risk this one :]
[14:40:58] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[14:41:04] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[14:41:31] <fabfur>	 revert merge finished
[14:41:35] <sukhe>	 thanks
[14:41:37] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync
[14:41:44] <fabfur>	 running puppet on A:cp ? 
[14:41:46] <sukhe>	 it would be shocking if we are doing HTTP1.0 there but who knows
[14:41:52] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[14:41:57] <sukhe>	 fabfur: go for it
[14:42:03] <fabfur>	 ack
[14:42:15] <sukhe>	 we should check after we are done merging the revert
[14:42:38] <sukhe>	 !log merging revert for HTTP1.0 discard on cp1107
[14:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:59] <sukhe>	 Sep 25 14:42:54 lvs1020 pybal[1450515]: [uploadlb_443 ProxyFetch] WARN: cp1109.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed (https://upload.wikimedia.org/varnish-fe-hc-5ebea9), 0.194 s
[14:43:03] <sukhe>	 yeah
[14:43:06] <sukhe>	 fabfur: roll it out to A:cp
[14:43:13] <fabfur>	 {{doing}}
[14:43:22] <sukhe>	 no batches
[14:43:59] <sukhe>	 it was definitely that 
[14:44:17] <sukhe>	 depool threshold saving the day once again
[14:44:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp1108.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1108.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:44:27] <sukhe>	 phuedx: definitely wait before we roll this out. thank you
[14:44:50] <sukhe>	 fabfur: what's the progress?
[14:44:56] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191333 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[14:45:52] <sukhe>	 I still can't believe it was actually HTTP1.0. or we put it in the wrong place but yeah
[14:46:02] <sukhe>	 though varnish won
[14:46:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:46:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:46:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:46:24] <sukhe>	 phew
[14:46:29] <fabfur>	 sukhe: currently 40%
[14:46:33] <sukhe>	 thanks <3
[14:46:57] <sukhe>	 fabfur: lesson for us I guess for next time: the old approach of disabling puppet on A:cp for even trivial changes
[14:46:59] <fabfur>	 I'll open a ticket about this
[14:47:00] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191333 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[14:47:04] <sukhe>	 enabling on one and then going ahead
[14:47:05] <fabfur>	 yep
[14:47:10] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:47:17] <sukhe>	 ^ unrelated, this is durum
[14:47:20] <fabfur>	 in that case we also had to wait for a probe to fail
[14:48:01] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[14:48:10] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[14:48:27] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:49:22] <fabfur>	 sukhe: {{done}}
[14:49:30] <sukhe>	 thanks!
[14:49:41] <wikibugs>	 (03PS3) 10Jelto: ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922)
[14:49:48] <sukhe>	 phuedx: go ahead please :)
[14:51:07] <wikibugs>	 (03PS4) 10Jelto: ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922)
[14:51:56] <wikibugs>	 (03CR) 10Jelto: "thanks for the review, replies in the comments" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[14:52:10] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:53:05] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage
[14:53:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet
[14:54:26] <wikibugs>	 (03CR) 10Elukey: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[14:54:40] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[14:54:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11214861 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[14:55:00] <wikibugs>	 (03CR) 10TChin: [C:03+2] [eventgate-*] Bump to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191234 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin)
[14:55:10] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7051/co" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[14:55:23] <wikibugs>	 (03CR) 10MVernon: [C:03+1] ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[14:57:23] <wikibugs>	 (03Merged) 10jenkins-bot: [eventgate-*] Bump to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191234 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin)
[14:58:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:59:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.98%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[14:59:32] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Nokia: support mixing of L2 and L3 subinterfaces on SR Linux (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[14:59:36] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage
[14:59:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214908 (10phaultfinder)
[15:00:04] <jouncebot>	 brennen and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1500).
[15:00:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11214910 (10cmooney)
[15:00:27] <logmsgbot>	 jmm@cumin2002 drain-node (PID 416850) is awaiting input
[15:00:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11214911 (10cmooney)
[15:01:33] <logmsgbot>	 !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy
[15:01:53] <logmsgbot>	 !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy
[15:02:55] <phuedx>	 sukhe: Belated ACK. Thanks
[15:03:13] <logmsgbot>	 sukhe@cumin1003 reimage (PID 4138186) is awaiting input
[15:03:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214927 (10Jhancock.wm) moved one server to different breaker. holding to see if alert stops going off.
[15:03:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214928 (10Jhancock.wm) a:03Jhancock.wm
[15:04:21] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum7003.magru.wmnet with OS trixie
[15:04:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet
[15:05:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11214933 (10phaultfinder)
[15:05:47] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@5d4a2bb]: deploy phab2002 for T404134
[15:05:53] <stashbot>	 T404134: Merge Phorge's upstream master (2025-09-08) into our wmf/stable - https://phabricator.wikimedia.org/T404134
[15:06:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11214942 (10Jhancock.wm) a:03Jhancock.wm moved one server to different breaker. holding to see if alert stops triggering.
[15:06:28] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@5d4a2bb]: deploy phab2002 for T404134 (duration: 00m 41s)
[15:08:01] <wikibugs>	 (03PS5) 10Jasmine: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891)
[15:08:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11214950 (10elukey) @Jhancock.wm I cannot reboot the host, tried via console and BMC/Redfish API, it seems stuck in some weird limbo. If you have a moment could you please check it?
[15:09:09] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11214952 (10phaultfinder)
[15:10:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet
[15:10:11] <wikibugs>	 (03PS6) 10Jasmine: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891)
[15:10:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet
[15:10:53] <wikibugs>	 (03CR) 10Jasmine: wmnet: update deployment CNAME record to deploy2002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine)
[15:11:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609 (10cmooney) 03NEW p:05Triage→03Medium
[15:11:04] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@5d4a2bb]: deploy phab1004 for T404134
[15:11:10] <stashbot>	 T404134: Merge Phorge's upstream master (2025-09-08) into our wmf/stable - https://phabricator.wikimedia.org/T404134
[15:11:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet
[15:11:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11214968 (10cmooney)
[15:11:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11214967 (10cmooney)
[15:11:59] <wikibugs>	 (03PS3) 10Cathal Mooney: Nokia: support mixing of L2 and L3 subinterfaces on SR Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577)
[15:13:17] <taavi>	 getting 'Unable to load the "Arcanist" library. Put "arcanist/" next to "phorge/" on disk.' from phabricator
[15:13:33] <btullis>	 Same here.
[15:13:39] <Zppix>	 Was just about to report the same
[15:13:56] <andre>	 we are deploying a new Phab version
[15:13:57] <jelto>	 yes Phabricator is getting a new version deploy, it should resolve soon 
[15:13:58] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia: support mixing of L2 and L3 subinterfaces on SR Linux (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[15:14:53] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@5d4a2bb]: deploy phab1004 for T404134 (duration: 03m 49s)
[15:15:22] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage
[15:15:36] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: support mixing of L2 and L3 subinterfaces on SR Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[15:15:37] <jelto>	 Phabricator should be back :)
[15:16:01] <btullis>	 Thanks jelto .
[15:16:13] <sukhe>	 thanks jelto!
[15:16:27] <jelto>	 mostly brennen and andre :) I was just holding hands
[15:16:30] <Dreamy_Jazz>	 Thanks! Just as an aside I needed to force refresh to get some styles to look normal
[15:17:25] <Dreamy_Jazz>	 Though that may be completely unrelated to this :D
[15:17:44] <brennen>	 Dreamy_Jazz: probably related
[15:18:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet
[15:18:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11214980 (10Jhancock.wm) @bking can you check the site.pp and preseed.yaml files for accuracy? the reimage cookbook is acting like there's a possible misconfig there. thank you!
[15:18:48] <brennen>	 sorry for downtime there all.  slightly longer deploy than standard.
[15:19:14] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage
[15:20:41] <btullis>	 brennen: np and thanks for the new version.
[15:22:08] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[15:22:32] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[15:23:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet
[15:24:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet
[15:28:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11215037 (10Jhancock.wm) @elukey found the server off. i could ping the BMC and login to it. I've powered it back up for you.
[15:28:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215042 (10bking) a:03bking
[15:30:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215048 (10bking) @Jhancock.wm I think we had a similar ticket for the same hardware in EQIAD (T399105) . I'll take a look there and see if we m...
[15:30:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11215053 (10Jhancock.wm)
[15:31:59] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[15:32:57] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[15:32:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.provision for device fasw1-f5a-codfw.mgmt.codfw.wmnet
[15:33:00] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:34:06] <sukhe>	 !log sudo puppet node deactivate durum7003.magru.wmnet: stuck after reimage with failed puppet run
[15:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:50] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:35:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11215096 (10phaultfinder)
[15:35:17] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm
[15:38:26] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v1.0.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191393 (https://phabricator.wikimedia.org/T385180) (owner: 10Santiago Faci)
[15:38:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw1-f5a-codfw - pt1979@cumin2002"
[15:38:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw1-f5a-codfw - pt1979@cumin2002"
[15:38:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:39:09] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:40:12] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploying v1.0.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191393 (https://phabricator.wikimedia.org/T385180) (owner: 10Santiago Faci)
[15:41:42] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5001.eqsin.wmnet with OS trixie
[15:44:16] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:44:43] <wikibugs>	 (03PS5) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490)
[15:44:43] <wikibugs>	 (03PS5) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490)
[15:44:43] <wikibugs>	 (03PS5) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490)
[15:44:50] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:45:38] <hnowlan>	 jouncebot: nowandnext
[15:45:38] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1500)
[15:45:38] <jouncebot>	 In 0 hour(s) and 14 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600)
[15:45:38] <jouncebot>	 In 0 hour(s) and 14 minute(s): Deployment Server Switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600)
[15:47:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:47:11] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine)
[15:48:22] <jasmine_>	 hi folks, just a reminder that we'll be switching the deployment server to codfw shortly
[15:50:32] <wikibugs>	 (03PS1) 10Bking: Add dse-k8s-worker2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1191441 (https://phabricator.wikimedia.org/T399778)
[15:50:34] <brennen>	 noting here that i have a couple of backports to handle before the train, will wait until after deployment server switchover.
[15:51:47] <wikibugs>	 (03PS2) 10Bking: Add dse-k8s-worker2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1191441 (https://phabricator.wikimedia.org/T399778)
[15:54:12] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:54:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11215224 (10phaultfinder)
[15:56:08] <wikibugs>	 (03CR) 10Bking: [C:03+2] "trivial change and blocking DC Ops, so self-merging." [puppet] - 10https://gerrit.wikimedia.org/r/1191441 (https://phabricator.wikimedia.org/T399778) (owner: 10Bking)
[16:00:05] <jouncebot>	 jhathaway and moritzm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:05] <jouncebot>	 jasmine_: That opportune time for a Deployment Server Switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600).
[16:01:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26), 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215247 (10bking) @Jhancock.wm it looks like the host was missing from site.pp. I've added it, and you should be good to g...
[16:01:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26), 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215250 (10bking) a:05bking→03Jhancock.wm
[16:02:45] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618 (10Papaul) 03NEW
[16:04:50] <jinxer-wm>	 FIRING: DiskSpace: Disk space deploy1003:9100:/srv 2.827% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[16:04:50] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:05:04] <hnowlan>	 Please refrain from deploying or otherwise using the deploy servers until the all-clear is given
[16:05:32] <wikibugs>	 (03CR) 10Mstyles: [C:03+2] OATHAuth: Increase 2FA opt-in to 20% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191100 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[16:05:56] <wikibugs>	 (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[16:06:19] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[16:06:22] <wikibugs>	 (03Merged) 10jenkins-bot: OATHAuth: Increase 2FA opt-in to 20% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191100 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[16:06:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26), 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215269 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2...
[16:07:54] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Relax leading slash match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368)
[16:08:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest-gateway: Relax leading slash match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[16:08:09] <wikibugs>	 (03PS2) 10Clément Goubert: rest-gateway: Relax leading slash match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368)
[16:08:45] <wikibugs>	 (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt)
[16:09:47] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device fasw1-f5a-codfw.mgmt.codfw.wmnet
[16:09:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11215277 (10phaultfinder)
[16:09:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11215278 (10phaultfinder)
[16:11:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11215284 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there are 3 servers on the EOL list. Two of them have already been replaced but waiting on exte...
[16:13:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device fasw1-f5a-codfw
[16:13:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw1-f5a-codfw
[16:13:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.provision for device fasw1-f5b-codfw.mgmt.codfw.wmnet
[16:13:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:15:07] <jasmine_>	 !log sopped spiderpig-apiserver, spiderpig-jobrunner on deploy1003
[16:15:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11215304 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there are at least two servers in this rack that are on the EOL list and can be removed once th...
[16:19:37] <logmsgbot>	 pt1979@cumin2002 provision (PID 457966) is awaiting input
[16:22:45] <logmsgbot>	 !log jasmine@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet,releases1003.eqiad.wmnet with reason: Deployment server switchover
[16:23:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw1-f5b-codfw - pt1979@cumin2002"
[16:23:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw1-f5b-codfw - pt1979@cumin2002"
[16:23:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:23:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11215341 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rack had a huge spike that has setteled to threshold since the original switchover. rack has se...
[16:24:16] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine)
[16:25:27] <wikibugs>	 (03PS1) 10Clare Ming: xLab: instrument page visits with delayed events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191447
[16:25:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191447 (owner: 10Clare Ming)
[16:25:55] <logmsgbot>	 !log jasmine@dns1004 START - running authdns-update
[16:26:01] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum7003.magru.wmnet with OS bookworm
[16:26:28] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[16:27:30] <logmsgbot>	 !log jasmine@dns1004 END - running authdns-update
[16:28:24] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] hieradata: update deployment_server to deploy2002 [puppet] - 10https://gerrit.wikimedia.org/r/1190300 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine)
[16:28:28] <icinga-wm>	 RECOVERY - ganeti-noded running on ganeti-test2002 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[16:29:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405403#11215380 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there is one server that is EOL and could be decommed. physically marking rack as full and addi...
[16:30:55] <wikibugs>	 (03PS2) 10Slyngshede: P:openldap::management add netbox-readonly-access to offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494)
[16:31:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11215392 (10cmooney)
[16:31:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11215393 (10cmooney)
[16:31:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from deployment.eqiad.wmnet in ulsfo #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=deployment.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[16:32:13] <hnowlan>	 uhh almost certainly related to the switchover
[16:32:13] <_joe_>	 hnowlan: tsk tsk
[16:32:14] <hnowlan>	 looking 
[16:32:16] <_joe_>	 ahah yes
[16:32:22] <hnowlan>	 ahh 
[16:32:23] <hnowlan>	 spiderpig :) 
[16:32:26] <claime>	 Ooooh spiderpig
[16:32:27] <_joe_>	 yep
[16:33:09] <hnowlan>	 apologies oncallers!
[16:33:13] <_joe_>	 🤌what the hell is a spider pig? 🤌
[16:33:26] <claime>	 It does whatever a spiderpig does
[16:33:34] <claime>	 Alert silenced
[16:34:02] <jynus>	 _joe_: https://youtu.be/BARjPuUN36Y?si=cDSJfNGYmBWjjnEi&t=26
[16:34:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11215414 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm marking rack as full. adding to main tracking list. there are 4 servers in the EOL list in this...
[16:35:32] <claime>	 !incidents
[16:35:32] <sirenbot>	 6799 (ACKED)  ATSBackendErrorsHigh cache_text sre (deployment.eqiad.wmnet ulsfo)
[16:35:33] <sirenbot>	 6795 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet codfw)
[16:36:15] <jasmine_>	 ty!
[16:36:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:40:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11215456 (10Jhancock.wm) a:03Jhancock.wm moved one server to different breaking. holding to see if resolved.
[16:40:44] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:41:18] <logmsgbot>	 !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[16:41:51] <logmsgbot>	 !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[16:46:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11215467 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm moved power of one server to different breaker. marking rack as physically full for now. But th...
[16:46:32] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:47:12] <wikibugs>	 (03PS1) 10Santiago Faci: xLab: Deploying v1.0.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191449 (https://phabricator.wikimedia.org/T385180)
[16:52:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623 (10RobH) 03NEW
[16:52:22] <wikibugs>	 06SRE, 05MW-1.45-notes (1.45.0-wmf.21; 2025-09-30), 13Patch-For-Review, 03Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic... - https://phabricator.wikimedia.org/T404204#11215516
[16:52:55] <logmsgbot>	 !log jasmine@deploy2002 Started scap sync-world: Test deployment to validate deployment server switchover - T399891.
[16:53:01] <stashbot>	 T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891
[16:54:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device fasw1-f5b-codfw.mgmt.codfw.wmnet
[16:56:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11215549 (10RobH) @BCornwall,  Congrats, since we've worked together on so many other projects previously I made the #traffic team's host migration tracking task first!  As such, we ma...
[16:57:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11215552 (10RobH)
[16:58:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11215553 (10EMill-WMF) Yes, I can see the dashboards I was intending to now! Thank you very much for everyone who helped resolve my issue, and apologies f...
[16:58:44] <wikibugs>	 (03PS2) 10Stevemunene: admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073)
[16:59:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[17:00:05] <jouncebot>	 jasmine_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Deployment Server Switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600).
[17:00:05] <jouncebot>	 bd808: How many deployers does it take to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1700).
[17:00:33] <hnowlan>	 deployment switchover still in progress, nearly done :) 
[17:00:50] <jasmine_>	 ty!
[17:01:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11215627 (10Dzahn) Thanks @EMill-WMF for confirming that. Great!  And no need to apologize. The whole thing was about how that process is confusing even f...
[17:02:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device fasw1-f5b-codfw
[17:02:12] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw1-f5b-codfw
[17:02:53] <wikibugs>	 (03PS3) 10Stevemunene: admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073)
[17:07:18] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Add new grants for dbprov1007 & dbprov2007 backups [puppet] - 10https://gerrit.wikimedia.org/r/1191451 (https://phabricator.wikimedia.org/T403166)
[17:11:39] <wikibugs>	 (03PS1) 10DDesouza: Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410)
[17:12:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[17:12:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[17:14:31] <wikibugs>	 (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[17:14:33] <wikibugs>	 (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[17:14:54] <mutante>	 Amir1: thank you for deploying the mariadb template thing the other day. I appreciated that.
[17:15:03] <wikibugs>	 (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[17:15:09] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166)
[17:17:44] <wikibugs>	 (03PS2) 10DDesouza: Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410)
[17:18:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628 (10cmooney) 03NEW p:05Triage→03Medium
[17:18:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11215779 (10cmooney)
[17:18:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215780 (10cmooney)
[17:19:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11215792 (10cmooney)
[17:21:45] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "rest-gateway: Tighten non mw-rest-php matches" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191456
[17:21:58] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert "rest-gateway: Tighten non mw-rest-php matches" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191456 (owner: 10Clément Goubert)
[17:23:24] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11215807 (10Dzahn) To be pragmatic.. let's just start with the lowest level of (that type of) access and see in practice if you run into any blockers. Level...
[17:23:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "rest-gateway: Tighten non mw-rest-php matches" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191456 (owner: 10Clément Goubert)
[17:25:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[17:25:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[17:25:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630 (10cmooney) 03NEW p:05Triage→03Medium
[17:26:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215830 (10cmooney)
[17:26:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215831 (10cmooney)
[17:27:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215832 (10cmooney)
[17:27:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215833 (10cmooney)
[17:32:23] <logmsgbot>	 !log jasmine@deploy2002 Finished scap sync-world: Test deployment to validate deployment server switchover - T399891. (duration: 39m 28s)
[17:32:29] <stashbot>	 T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891
[17:33:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632 (10cmooney) 03NEW p:05Triage→03Medium
[17:33:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11215881 (10cmooney)
[17:33:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215882 (10cmooney)
[17:34:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.905s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:34:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[17:35:00] <wikibugs>	 (03PS1) 10Dzahn: admin: upgrade tais-lessa from ldap_only to privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129)
[17:36:29] <wikibugs>	 (03PS2) 10Dzahn: admin: upgrade tais-lessa from ldap_only to privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129)
[17:37:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11215886 (10cmooney)
[17:37:41] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215890 (10Dzahn) 05Open→03In progress a:03Dzahn
[17:39:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215905 (10cmooney)
[17:39:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.905s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:40:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215913 (10cmooney)
[17:41:19] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215925 (10Dzahn) To be pragmatic I am going with a "one level upgrade" from the lowest to the second lowest level here:  https:...
[17:41:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11215927 (10cmooney)
[17:41:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11215928 (10cmooney)
[17:42:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215944 (10cmooney)
[17:42:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215945 (10cmooney)
[17:42:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215946 (10cmooney)
[17:42:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215947 (10cmooney)
[17:43:05] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215948 (10Dzahn) @TLessa-WMF Could you take a look at signing L3 while the code change I uploaded is in review? Cheers
[17:43:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215950 (10cmooney)
[17:44:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191377 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson)
[17:44:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[17:45:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215952 (10Dzahn)
[17:45:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11215953 (10BCornwall) p:05Triage→03Medium
[17:45:16] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v1.0.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191449 (https://phabricator.wikimedia.org/T385180) (owner: 10Santiago Faci)
[17:46:06] <jasmine_>	 update: deployment switchover is now complete :) folks should be able to deploy from deploy2002 now, let us know if you notice anything 
[17:46:32] <mutante>	 congrats jasmine_ 
[17:46:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215958 (10Dzahn)
[17:46:52] <dancy>	 Yay!
[17:47:13] <jasmine_>	 tyty, (and thanks h.nowlan and c.laime for shadowing!)
[17:47:18] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploying v1.0.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191449 (https://phabricator.wikimedia.org/T385180) (owner: 10Santiago Faci)
[17:47:21] <mutante>	 fwiw, /srv/patches should have been synced automatically.. for the security guys
[17:49:13] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11215964 (10BTracy-WMF) That sounds like the right path forward. Thank you!
[17:49:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11215963 (10cmooney)
[17:49:43] <brennen>	 jasmine_: ty!
[17:50:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11215967 (10cmooney)
[17:50:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11215968 (10cmooney)
[17:50:19] <brennen>	 jouncebot nowandnext
[17:50:19] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Deployment Server Switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600)
[17:50:19] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1700)
[17:50:19] <jouncebot>	 In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1800)
[17:52:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#11215973 (10cmooney) 05Open→03Declined Gonna close this one for now.  Doing it in our YAML data for the occasional virtual-chassis...
[17:52:51] <brennen>	 going ahead with a couple of backports, then train.
[17:54:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191334 (https://phabricator.wikimedia.org/T405514) (owner: 10Sergio Gimeno)
[17:54:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191414 (https://phabricator.wikimedia.org/T405511) (owner: 10Michael Große)
[17:56:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede)
[17:56:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Update server provision script to support Nokia switches - https://phabricator.wikimedia.org/T405637 (10cmooney) 03NEW p:05Triage→03Medium
[17:56:47] <logmsgbot>	 !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[17:57:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Update server provision script to support Nokia switches - https://phabricator.wikimedia.org/T405637#11216041 (10cmooney)
[17:57:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216042 (10cmooney)
[17:57:29] <logmsgbot>	 !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[17:57:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Needs manager approval on task, but patch looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) (owner: 10Dzahn)
[17:59:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11216044 (10Dzahn) @TLessa-WMF We need one more thing. Please get your manager to approve here on this ticket. Thank you
[18:00:07] <jouncebot>	 brennen and dduvall: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1800).
[18:01:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216104 (10Dzahn) Great! I am going ahead.  Meanwhile.. please check the box that you have read https://wikitech.wikimedia.org/wiki/Data_Platform/Data_acce...
[18:02:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640 (10cmooney) 03NEW p:05Triage→03Medium
[18:04:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640#11216198 (10cmooney)
[18:04:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216197 (10cmooney)
[18:04:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11216208 (10Jhancock.wm)
[18:04:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640#11216214 (10cmooney)
[18:04:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216215 (10cmooney)
[18:08:31] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[18:09:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11216318 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet wi...
[18:09:45] <wikibugs>	 (03Merged) 10jenkins-bot: fix: provide a eventType fallback for already scheduled jobs [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191334 (https://phabricator.wikimedia.org/T405514) (owner: 10Sergio Gimeno)
[18:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: fix: prevent type-error from outdated serialization [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191414 (https://phabricator.wikimedia.org/T405511) (owner: 10Michael Große)
[18:11:03] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm
[18:11:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11216380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm
[18:11:16] <brennen>	 hmm, unexpected commits in mediawiki-staging
[18:12:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "I left some inline comments but they can all be taken as optional/nitpicks/comments. Compiler compiles and seems reasonable:" [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[18:14:17] <mutante>	 brennen: jasmine_: that sounds like one of the rsync timers/services has not run yet or failed
[18:14:32] <brennen>	 yeah, my guess is these have already been deployed.
[18:14:37] <brennen>	 based on timing of patches.
[18:14:41] <wikibugs>	 (03PS1) 10Aklapper: Phabricator: Update recipients of weekly Tech News mail [puppet] - 10https://gerrit.wikimedia.org/r/1191468 (https://phabricator.wikimedia.org/T405638)
[18:14:50] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[18:15:03] <brennen>	 which... probably means it's fine?
[18:15:49] <mutante>	 puppet class deployment::rsync has the stuff that syncs automatically between deployment servers
[18:16:01] <mutante>	 it handles /srv/deployment and /srv/patches
[18:16:14] <mutante>	 maybe staging is not handled 
[18:17:03] <mutante>	 it's possible this should be added (rsync with --delete ?) to keep them identical. 
[18:17:51] <brennen>	 plausible.  my mental model at the moment is that staging was a bit out of date but those changes were fetched down so it should now be in sync.
[18:17:52] <dancy>	 `/usr/local/bin/scap-master-sync` is supposed to do that.  I added that as a step in https://wikitech.wikimedia.org/wiki/Switch_Datacenter/DeploymentServer#Procedure last week
[18:18:16] <mutante>	 what is rsync host and rsync dest switches with $deployment_server
[18:19:11] <mutante>	 I see. Optionally we could a path to the puppet class so that the same code handles all.
[18:19:50] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[18:19:50] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[18:20:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11216493 (10phaultfinder)
[18:20:28] <mutante>	 brennen: are you trying the scap-master-sync then?
[18:20:40] <wikibugs>	 (03CR) 10Quiddity: [C:03+1] Phabricator: Update recipients of weekly Tech News mail [puppet] - 10https://gerrit.wikimedia.org/r/1191468 (https://phabricator.wikimedia.org/T405638) (owner: 10Aklapper)
[18:20:48] <brennen>	 i'm mid-deploy job on spiderpig
[18:21:12] <dancy>	 Scap runs scap-master-sync of any `scap sync-*` operation.
[18:21:19] <dancy>	 s/of/during/
[18:21:27] <brennen>	 could say no here, but i think it should be fine to proceed under the assumption the 2 extra config changes have already been deployed a while ago.
[18:21:29] <mutante>	 *nod* to both of you. alrighty
[18:23:11] <dancy>	 brennen: I say go
[18:23:13] <wikibugs>	 (03PS1) 10Dzahn: admin: add user btracy with privateadata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366)
[18:23:48] <mutante>	 only 2 changes and its been verified that they are already deployed.. sounds ok
[18:23:56] <dancy>	 hmm..
[18:24:14] * dancy checks the sync script
[18:24:20] <brennen>	 `tree /srv/patches` looks the same on both boxen
[18:24:22] <mutante>	 well, I guess we jumped from assumption to verified there :P
[18:24:25] <brennen>	 at least for # of files
[18:24:34] <brennen>	 mutante: both were merged earlier.
[18:24:34] <logmsgbot>	 !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1191334|fix: provide a eventType fallback for already scheduled jobs (T405514)]], [[gerrit:1191414|fix: prevent type-error from outdated serialization (T405511)]]
[18:24:43] <stashbot>	 T405514: InvalidArgumentException: 'type' parameter is mandatory - https://phabricator.wikimedia.org/T405514
[18:24:43] <stashbot>	 T405511: TypeError: GrowthExperiments\NewcomerTasks\Task\TaskSet::__construct(): Argument #4 ($filters) must be of type GrowthExperiments\NewcomerTasks\Task\TaskSetFilters, array given - https://phabricator.wikimedia.org/T405511
[18:24:50] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:25:13] <mutante>	 yea, /srv/patches is handled by puppet for sure. I once added that.
[18:25:20] <mutante>	 rsync-patches_module.timer
[18:25:38] <mutante>	 systemctl cat rsync-patches_module.service
[18:26:15] <mutante>	 cat /usr/local/sbin/sync-patches_module
[18:26:28] <mutante>	 it has --delete
[18:26:40] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216519 (10BTracy-WMF)
[18:26:44] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:26:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:27:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216524 (10BTracy-WMF) I've read the guidelines and updated this request to reflect. Thanks, again.
[18:27:59] <dancy>	 The scap-sync-master script looks right too
[18:28:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647 (10RobH) 03NEW
[18:29:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216543 (10Dzahn) Great. Thanks! I have uploaded a code change to make it happen and it's in review now.
[18:30:23] <brennen>	 hrm, crap.  maybe these didn't get deployed.  e.g. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1191100 wasn't deployed with `scap backport`.
[18:31:56] <dancy>	 maryum: Are you around?
[18:32:01] <mutante>	 < jinxer-wm> FIRING: SystemdUnitFailed: rsync-srv-patches-releases2003.codfw.wmnet.service on releases2003:9100 
[18:32:12] <mutante>	 releases hosts are also pulling srv/patches from deployment
[18:32:15] <brennen>	 the other one is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1191427
[18:32:35] <mutante>	 and somehow that deployment_server switch and timing with puppet runs or whatever made it fail
[18:32:47] <brennen>	 oh wait, both for beta and not synced, perhaps?
[18:33:23] <dancy>	 brennen: The protocol is that even beta-only changes should be backported (where scap backport itself will short-circuit the process).. but I don't know if that protocol was followed here.
[18:33:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11216609 (10RobH) a:03BCornwall
[18:33:31] <dancy>	 Doesn't seem so
[18:34:05] <taavi>	 1191100 was +2'd directly after the "do not deploy anything, deployment server is being switched over" message :/
[18:34:07] <brennen>	 yeah, beta-only change not using scap backport would explain the 1191427 one.  
[18:34:13] <dancy>	 Two unusual config deployments at a particularly weird time.
[18:34:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216626 (10Dzahn) 05Open→03In progress
[18:34:54] <brennen>	 1191100 not so much.
[18:34:59] <taavi>	 probably best to just revert?
[18:35:12] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216632 (10Dzahn) a:03Dzahn
[18:36:13] <wikibugs>	 (03CR) 10Majavah: "Hi, please note that patches need to be pulled down to the deployment server (either with `scap backport` or manually) even if they only t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[18:37:10] <maryum>	 I'm here!
[18:37:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216658 (10Dzahn)
[18:37:22] <wikibugs>	 (03PS4) 10Daniel Kinzler: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440
[18:37:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216659 (10Dzahn) p:05Triage→03High
[18:37:40] <brennen>	 maryum: happen to know if that config change had already been deployed?
[18:37:43] <wikibugs>	 (03PS4) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga)
[18:37:56] <maryum>	 no it hadn't, I forgot I needed to schedule a backport window
[18:38:01] <maryum>	 apologies for breaking things
[18:38:27] <brennen>	 not a huge problem - we can either stop this deploy and do a revert, or go ahead with it if you're ok with it being deployed now
[18:38:29] <maryum>	 how can I help?
[18:38:39] <maryum>	 if you can deploy it now that would be great
[18:38:40] <brennen>	 i see it's just changing a percentage that was already at 10 so it doesn't _seem_ super risky to me?
[18:38:44] <maryum>	 it's not risky
[18:38:51] <brennen>	 ok, let's go ahead with it, if you don't mind waiting around a bit just in case.
[18:38:55] <maryum>	 yeah I'm here
[18:39:02] <brennen>	 cool, thanks all.
[18:39:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga)
[18:40:12] <brennen>	 (sorry for the red herrings re: deployment server switchover.)
[18:40:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366) (owner: 10Dzahn)
[18:40:52] <dancy>	 Crisis averted!
[18:44:58] <mutante>	 :)
[18:46:34] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:46:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:53:10] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216717 (10Dzahn) NDA confirmed. Taking care of the LDAP group memberships.  @WMDE-leszek I am not familiar with Superset SQL lab and the `analytics-privatedata-users` group can be confi...
[18:54:17] <wikibugs>	 (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191483
[18:54:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216719 (10Dzahn) I am starting out with the lower level of access. .this should take care of dashboards and web logins that also work for other WMDE staff.  We can go from there.
[18:54:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis)
[18:55:47] <mutante>	 !log LDAP - added member: uid=elishacohenwmde,ou=people,dc=wikimedia,dc=org
[18:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:05] <wikibugs>	 (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191483
[18:56:10] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis)
[18:56:22] <mutante>	 !log LDAP - added uid=elishacohenwmde to 'wmde' and 'nda' T404359
[18:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:28] <stashbot>	 T404359: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359
[18:58:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:59:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (97.36%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[19:00:39] <wikibugs>	 (03PS1) 10Dzahn: admin: add elishacohenwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1191485 (https://phabricator.wikimedia.org/T404359)
[19:02:28] <wikibugs>	 (03PS1) 10Dduvall: gitlab runners: Allow new buildkit-syntax-forwarder gateway [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651)
[19:03:48] <wikibugs>	 (03PS3) 10CDanis: Export Prometheus metrics for MW primary DC & read only [puppet] - 10https://gerrit.wikimedia.org/r/1191483
[19:04:41] <logmsgbot>	 !log brennen@deploy2002 sgimeno, migr, brennen: Backport for [[gerrit:1191334|fix: provide a eventType fallback for already scheduled jobs (T405514)]], [[gerrit:1191414|fix: prevent type-error from outdated serialization (T405511)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:04:50] <stashbot>	 T405514: InvalidArgumentException: 'type' parameter is mandatory - https://phabricator.wikimedia.org/T405514
[19:04:50] <stashbot>	 T405511: TypeError: GrowthExperiments\NewcomerTasks\Task\TaskSet::__construct(): Argument #4 ($filters) must be of type GrowthExperiments\NewcomerTasks\Task\TaskSetFilters, array given - https://phabricator.wikimedia.org/T405511
[19:04:58] <wikibugs>	 (03PS4) 10CDanis: Export Prometheus metrics for MW primary DC & read only [puppet] - 10https://gerrit.wikimedia.org/r/1191483
[19:05:29] <logmsgbot>	 !log brennen@deploy2002 sgimeno, migr, brennen: Continuing with sync
[19:09:31] <wikibugs>	 (03PS2) 10Dduvall: gitlab runners: Allow new buildkit-syntax-forwarder gateway [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651)
[19:09:59] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405654 (10phaultfinder) 03NEW
[19:11:41] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[19:11:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11216802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with O...
[19:12:01] <brennen>	 hrm.  this is certainly taking its own sweet time.
[19:12:24] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "self-merging this because I already did the LDAP groups.. follow-up with an upgrade to analytics-privatedata-users tbd" [puppet] - 10https://gerrit.wikimedia.org/r/1191485 (https://phabricator.wikimedia.org/T404359) (owner: 10Dzahn)
[19:15:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216808 (10Dzahn) @WMDE-leszek  @ECohen_WMDE  All the things tied to the "wmde"/"nda" LDAP groups (ability to merge in WMDE repos, web logins,..) should already wor...
[19:15:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11216812 (10Jhancock.wm) 05Open→03Resolved
[19:16:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11216814 (10Jhancock.wm) 05Open→03Resolved
[19:18:43] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405654#11216834 (10Jhancock.wm) 05Open→03Invalid
[19:19:03] <logmsgbot>	 !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191334|fix: provide a eventType fallback for already scheduled jobs (T405514)]], [[gerrit:1191414|fix: prevent type-error from outdated serialization (T405511)]] (duration: 54m 29s)
[19:19:13] <stashbot>	 T405514: InvalidArgumentException: 'type' parameter is mandatory - https://phabricator.wikimedia.org/T405514
[19:19:14] <stashbot>	 T405511: TypeError: GrowthExperiments\NewcomerTasks\Task\TaskSet::__construct(): Argument #4 ($filters) must be of type GrowthExperiments\NewcomerTasks\Task\TaskSetFilters, array given - https://phabricator.wikimedia.org/T405511
[19:19:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216840 (10Dzahn) a:03Dzahn
[19:19:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216842 (10EBomani) Thank you so much, @thcipriani, @Dzahn and @Aklapper!  Also about changing the email, I went to the link (and even through my settings) but was unable to update it. My ema...
[19:20:22] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191491 (https://phabricator.wikimedia.org/T396381)
[19:20:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191491 (https://phabricator.wikimedia.org/T396381) (owner: 10TrainBranchBot)
[19:21:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216865 (10EBomani) Actually, might also be the case that my other username on here (accidentally created when I was no longer a contractor) is the issue.
[19:21:24] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191491 (https://phabricator.wikimedia.org/T396381) (owner: 10TrainBranchBot)
[19:24:22] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie
[19:24:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216875 (10Dzahn) Ah, there are 2 users indeed!   ` MariaDB [phabricator_user]> SELECT u.userName, ue.address, ue.isPrimary FROM phabricator_user.user u JOIN phabricator_user.user_email ue WH...
[19:29:49] <logmsgbot>	 jhathaway@cumin2002 reimage (PID 549667) is awaiting input
[19:30:21] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie
[19:31:21] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm
[19:31:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11216902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm executed with errors: -...
[19:33:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:35:14] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.20  refs T396381
[19:35:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216920 (10Dzahn)
[19:35:20] <stashbot>	 T396381: 1.45.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T396381
[19:36:23] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:36:36] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:42:00] <wikibugs>	 (03PS4) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246)
[19:42:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[19:43:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216931 (10Dzahn) Hey @EBomani For a moment, don't worry about the 2 Phabricator users. We can deal with this but treat the actual deployment access separately.  Most boxes are checked. You h...
[19:44:50] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:45:32] <wikibugs>	 (03PS5) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246)
[19:46:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[19:48:10] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216953 (10Dzahn) @EBomani I am sending you an email to your new, non-contractor email account. Please take a look.
[19:49:57] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:52:28] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Phabricator: Update recipients of weekly Tech News mail [puppet] - 10https://gerrit.wikimedia.org/r/1191468 (https://phabricator.wikimedia.org/T405638) (owner: 10Aklapper)
[19:53:29] <wikibugs>	 (03PS6) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246)
[19:54:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[19:55:14] <stephanebisson>	 jouncebot now
[19:55:14] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1800)
[19:56:12] <brennen>	 train ops have wrapped up and things seem stable.
[19:56:44] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:56:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:57:12] <stephanebisson>	 Mind if we start the backport window a few minutes early?
[19:57:51] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Improve matching for users renamed multiple times [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177)
[19:59:27] <wikibugs>	 (03CR) 10Dzahn: "rebasing so it can get done without waiting for the other request that still needs approval" [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366) (owner: 10Dzahn)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T2000).
[20:00:05] <jouncebot>	 danisztls, cjming, and stephanebisson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <stephanebisson>	 o/
[20:00:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[20:00:45] <wikibugs>	 (03PS7) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246)
[20:00:48] <danisztls>	 o/ I can self-deploy
[20:01:24] <stephanebisson>	 danisztls so you mind if I start, I know you're before me in the queue but I have a bit of an urgent situation?
[20:01:25] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "VERY COOL." [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis)
[20:01:27] <danisztls>	 cjming: you can go ahead of me, I'm doing a small change on one of my patches
[20:01:33] <danisztls>	 stephanebisson: I don'tm ind
[20:01:43] <stephanebisson>	 danisztls thanks!
[20:02:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[20:03:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191377 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson)
[20:03:49] <cjming>	 hi ! cool thanks!
[20:03:59] <wikibugs>	 (03Merged) 10jenkins-bot: Special:Contribute: configure new page target title for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191377 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson)
[20:04:19] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1191377|Special:Contribute: configure new page target title for enwiki (T327063)]]
[20:04:26] <stashbot>	 T327063: Adjust "New page" option of the Contribute options to point to a community page when it exists - https://phabricator.wikimedia.org/T327063
[20:04:45] <MatmaRex>	 i added a maintenance script run to the window, i'd appreciate if someone could start it for me once you're done with the more important deployments
[20:04:50] <jinxer-wm>	 FIRING: DiskSpace: Disk space deploy1003:9100:/srv 2.826% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[20:04:50] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:04:57] <wikibugs>	 (03PS2) 10Dzahn: admin: add user btracy with privateadata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366)
[20:05:18] <mutante>	 the Disk space alert on deploy1003 will also be somehow related to the deployment server switch
[20:05:31] <wikibugs>	 (03PS3) 10DDesouza: Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410)
[20:05:41] <cjming>	 danisztls: lmk when you're done after stephanebisson finishes
[20:06:28] <cjming>	 and if you're not ready by then, i can do my backport
[20:06:34] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:06:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:07:52] <mutante>	 brennen: jnuche: deploy1003 is somehow at 98% usage on /srv. so the deployment server that is not active anymore since today..  it's using more than twice the space of deploy2002 
[20:08:44] <mutante>	 suddenly alerted while deploy2002 is in use.. dunno yet if just gradual build up or something happened just now
[20:09:34] <mutante>	 scap-master-sync possibly related since as you said earlier it runs each time with scap ?
[20:09:44] <brennen>	 mutante: well that's... weird.
[20:10:28] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1191377|Special:Contribute: configure new page target title for enwiki (T327063)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:10:34] <stashbot>	 T327063: Adjust "New page" option of the Contribute options to point to a community page when it exists - https://phabricator.wikimedia.org/T327063
[20:10:49] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Continuing with sync
[20:12:15] <wikibugs>	 (03PS1) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510)
[20:12:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[20:14:55] <mutante>	 brennen: it's all about /srv/docker
[20:14:57] <wikibugs>	 (03PS8) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246)
[20:15:14] <brennen>	 that was my guess
[20:15:50] <logmsgbot>	 !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191377|Special:Contribute: configure new page target title for enwiki (T327063)]] (duration: 11m 31s)
[20:15:57] <stashbot>	 T327063: Adjust "New page" option of the Contribute options to point to a community page when it exists - https://phabricator.wikimedia.org/T327063
[20:15:58] <mutante>	 https://phabricator.wikimedia.org/T401647
[20:16:19] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis)
[20:16:20] <wikibugs>	 (03PS2) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510)
[20:16:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[20:16:32] <mutante>	 !log deploy1003 alerted because /srv/ is at 98% - T401647
[20:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:38] <stashbot>	 T401647: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647
[20:16:43] <stephanebisson>	 danisztls, cjming I'm done. Sorry for jumping the queue
[20:16:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[20:16:56] <danisztls>	 stephanebisson: no problem, I already started my deploy
[20:17:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[20:17:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191425 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza)
[20:17:38] <cjming>	 stephanebisson: thanks!
[20:17:46] <cjming>	 danisztls: do you want to go next?
[20:17:50] <wikibugs>	 (03PS9) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246)
[20:17:53] <wikibugs>	 (03Merged) 10jenkins-bot: Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[20:18:02] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191425 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza)
[20:18:25] <logmsgbot>	 !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1191454|Pre-deploy reader foundational survey on enwiki (T405410)]], [[gerrit:1191425|Deploy Design Research participant recruitment survey on jawiki (T405577)]]
[20:18:33] <stashbot>	 T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410
[20:18:33] <stashbot>	 T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577
[20:18:45] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Export Prometheus metrics for MW primary DC & read only [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis)
[20:20:09] <wikibugs>	 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11217099 (10Dzahn) The alerted happened during of the first deploys after `deploy2002` became the active deployment server today.  Seeing the other deployment server alert suddenly made me look. Thinking it wa...
[20:20:37] <wikibugs>	 (03PS2) 10BCornwall: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[20:21:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11217124 (10Papaul)
[20:24:34] <logmsgbot>	 !log dani@deploy2002 dani: Backport for [[gerrit:1191454|Pre-deploy reader foundational survey on enwiki (T405410)]], [[gerrit:1191425|Deploy Design Research participant recruitment survey on jawiki (T405577)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:24:43] <wikibugs>	 06SRE, 06serviceops-radar, 06Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#11217157 (10Dzahn) We got a new alert today. It was deploy1003 at 98% on /srv/. Still about /srv/docker.
[20:24:43] <stashbot>	 T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410
[20:24:43] <stashbot>	 T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577
[20:26:02] <logmsgbot>	 !log dani@deploy2002 dani: Continuing with sync
[20:26:03] <wikibugs>	 (03PS1) 10CDanis: Mediawiki Etcd Prometheus: use datacenter label [puppet] - 10https://gerrit.wikimedia.org/r/1191501
[20:26:34] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Mediawiki Etcd Prometheus: use datacenter label [puppet] - 10https://gerrit.wikimedia.org/r/1191501 (owner: 10CDanis)
[20:28:49] <wikibugs>	 (03PS3) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510)
[20:28:53] <danisztls>	 cjming: all yours
[20:29:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[20:29:58] <wikibugs>	 (03PS1) 10Krinkle: [WIP] varnish: No-op for CI [puppet] - 10https://gerrit.wikimedia.org/r/1191502
[20:31:00] <logmsgbot>	 !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191454|Pre-deploy reader foundational survey on enwiki (T405410)]], [[gerrit:1191425|Deploy Design Research participant recruitment survey on jawiki (T405577)]] (duration: 12m 35s)
[20:31:09] <stashbot>	 T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410
[20:31:09] <stashbot>	 T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577
[20:31:44] <cjming>	 ty!
[20:32:19] <wikibugs>	 (03PS2) 10Krinkle: [WIP] varnish: No-op for CI [puppet] - 10https://gerrit.wikimedia.org/r/1191502
[20:32:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191447 (owner: 10Clare Ming)
[20:34:37] <mutante>	 !log [releases2003:~] $ sudo systemctl reset-failed - monitoring alerted about failed rsync from deploy1003 after active deployment server switched to deploy2002 today - T405646
[20:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:43] <stashbot>	 T405646: SystemdUnitFailed - rsync on releases2003 - https://phabricator.wikimedia.org/T405646
[20:35:13] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[20:35:49] <wikibugs>	 (03CR) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[20:36:33] <wikibugs>	 (03PS1) 10Krinkle: Disable wmgUseMdotRouting on misc *.wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504
[20:36:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:37:51] <wikibugs>	 (03PS2) 10Krinkle: Disable wmgUseMdotRouting on misc *.wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504
[20:38:55] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[20:39:03] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: add user btracy with privateadata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366) (owner: 10Dzahn)
[20:40:43] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: instrument page visits with delayed events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191447 (owner: 10Clare Ming)
[20:40:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11217235 (10Dzahn) Hey @BTracy-WMF You have been added to the `analytics-privatedata-users` group as requested.  Let us know if you ne...
[20:41:03] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1191447|xLab: instrument page visits with delayed events]]
[20:43:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11217244 (10Dzahn) 05In progress→03Resolved You can try out Superset now.
[20:44:05] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[20:44:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.499s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:44:29] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[20:44:54] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11217247 (10Dzahn)
[20:46:35] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[20:47:00] <wikibugs>	 (03PS1) 10TChin: [eventgate_*] Bump eventgate to v1.25.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191506 (https://phabricator.wikimedia.org/T403169)
[20:47:03] <logmsgbot>	 !log cjming@deploy2002 cjming: Backport for [[gerrit:1191447|xLab: instrument page visits with delayed events]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:47:19] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[20:47:27] <logmsgbot>	 !log cjming@deploy2002 cjming: Continuing with sync
[20:48:36] <wikibugs>	 (03Abandoned) 10Krinkle: [WIP] varnish: No-op for CI [puppet] - 10https://gerrit.wikimedia.org/r/1191502 (owner: 10Krinkle)
[20:48:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11217261 (10Dzahn) a:05Dzahn→03cmadeo Hello @cmadeo Dayforce says you are the manager of @TLessa-WMF and to complete this acc...
[20:49:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int releases routed via main (k8s) 1.04s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:49:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217265 (10Dzahn) a:05Dzahn→03EBomani
[20:51:09] <wikibugs>	 (03PS4) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510)
[20:52:20] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191447|xLab: instrument page visits with delayed events]] (duration: 11m 17s)
[20:52:49] <cjming>	 MatmaRex: all yours!
[20:53:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[20:53:27] <MatmaRex>	 thanks. i don't have deployment access, is anyone around who could deploy my thing?
[20:53:36] <MatmaRex>	 eh, i guess it's almost the end of the window already
[20:53:39] <wikibugs>	 (03PS1) 10Dzahn: admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359)
[20:53:41] <MatmaRex>	 i'll schedule it for monday :)
[20:53:44] <cjming>	 oh! i can do it
[20:53:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359) (owner: 10Dzahn)
[20:54:24] <cjming>	 MatmaRex: want me to deploy your stuff?
[20:54:27] <MatmaRex>	 cjming: thanks, i think i'll reschedule it for monday, i don't want to sit here until late myself either
[20:54:32] <wikibugs>	 (03PS2) 10Dzahn: admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359)
[20:54:41] <cjming>	 alrighty
[20:54:57] <MatmaRex>	 thanks for the offer
[20:55:36] <wikibugs>	 (03PS3) 10Dzahn: admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359)
[20:55:52] <cjming>	 !log end of UTC late backport window
[20:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T2100)
[21:01:53] <wikibugs>	 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11217325 (10jhathaway) >>! In T404356#11209467, @MatthewVernon wrote: > Does `/boot` even need to be on a separate partition for UEFI...
[21:02:42] <wikibugs>	 (03PS3) 10Krinkle: Disable wmgUseMdotRouting on misc *.wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504
[21:04:54] <wikibugs>	 (03PS4) 10Krinkle: Disable wmgUseMdotRouting on misc *.wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510)
[21:05:03] <wikibugs>	 (03PS5) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510)
[21:05:43] <wikibugs>	 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11217341 (10jhathaway) >>! In T404356#11184299, @elukey wrote: > The host doesn't PXE/HTTP boot for some reason, I reopened the provi...
[21:07:45] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] "That makes sense. Sorry about that. I was under the wrong impression that patches to labs bypassed the normal backporting process. Next ti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[21:09:48] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[21:10:40] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Docker
[21:12:30] <wikibugs>	 (03PS1) 10DDesouza: Deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410)
[21:18:20] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366) (owner: 10RLazarus)
[21:20:01] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366) (owner: 10RLazarus)
[21:21:44] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[21:21:53] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[21:23:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.566s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:24:31] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[21:24:50] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[21:24:50] <wikibugs>	 (03PS2) 10DDesouza: Deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410)
[21:25:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[21:27:51] <logmsgbot>	 !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772
[21:27:57] <stashbot>	 T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772
[21:28:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.566s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:29:15] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[21:29:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.625s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:29:42] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[21:32:54] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772)
[21:33:30] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772)
[21:33:46] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper)
[21:34:09] <wikibugs>	 (03CR) 10Bking: [C:03+1] wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper)
[21:34:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.625s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:34:39] <wikibugs>	 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11217377 (10RLazarus) 05Open→03Resolved 1.23 is gone. 🎉
[21:37:19] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772)
[21:39:25] <wikibugs>	 (03PS1) 10Krinkle: Disable inert MobileFrontend on misc wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882)
[21:40:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Disable inert MobileFrontend on misc wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle)
[21:40:37] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper)
[21:42:28] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper)
[21:43:16] <logmsgbot>	 !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772
[21:43:23] <stashbot>	 T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772
[21:50:20] <wikibugs>	 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217414 (10Krinkle)
[21:50:28] <wikibugs>	 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101#11217420 (10RLazarus) 05Open→03Resolved
[21:50:57] <wikibugs>	 (03PS2) 10Krinkle: Disable inert MobileFrontend on misc wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882)
[21:52:05] <wikibugs>	 (03PS3) 10Krinkle: Disable inert MobileFrontend on wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882)
[21:53:23] <wikibugs>	 (03PS4) 10Krinkle: Disable inert MobileFrontend on wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882)
[21:54:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle)
[21:55:35] <wikibugs>	 (03Merged) 10jenkins-bot: Disable inert MobileFrontend on wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle)
[21:55:57] <logmsgbot>	 !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1191514|Disable inert MobileFrontend on wikimedia.org wikis that lack DNS (T152882)]]
[21:56:03] <stashbot>	 T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882
[22:02:09] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1191514|Disable inert MobileFrontend on wikimedia.org wikis that lack DNS (T152882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:02:15] <stashbot>	 T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882
[22:04:41] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Continuing with sync
[22:05:09] <wikibugs>	 (03PS6) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510)
[22:07:56] <wikibugs>	 (03PS1) 10RLazarus: mw-*: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191522 (https://phabricator.wikimedia.org/T403663)
[22:07:59] <wikibugs>	 (03PS1) 10RLazarus: mw-videoscaler: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191523 (https://phabricator.wikimedia.org/T403663)
[22:09:49] <logmsgbot>	 !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191514|Disable inert MobileFrontend on wikimedia.org wikis that lack DNS (T152882)]] (duration: 13m 52s)
[22:09:56] <stashbot>	 T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882
[22:14:50] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[22:16:20] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: shift old full graph hosts to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772)
[22:17:45] <wikibugs>	 (03PS1) 10RLazarus: kubernetes: Set default Envoy version to 1.29.12 [puppet] - 10https://gerrit.wikimedia.org/r/1191526 (https://phabricator.wikimedia.org/T403663)
[22:18:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11217559 (10TLessa-WMF) @Dzahn document signed, thank you so much for your help!   @cmadeo for context here, I am trying to be ab...
[22:21:55] <wikibugs>	 (03CR) 10RLazarus: "(Stacking this up for after the two MW patches in the charts repo, per Depends-On.)" [puppet] - 10https://gerrit.wikimedia.org/r/1191526 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[22:24:50] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:49:02] <wikibugs>	 (03PS7) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510)
[22:52:06] <wikibugs>	 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217641 (10Krinkle)
[22:58:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:59:25] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (98%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[23:02:14] <wikibugs>	 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217659 (10Krinkle) I was originally going to enable unified mobile routing on login.wikimedia.org today, as part of the misc wikimedia.org batch at T403510. Ho...
[23:03:54] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] api-gateway: Update configuration for Envoy 1.29 field deprecations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190377 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[23:05:46] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Update configuration for Envoy 1.29 field deprecations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190377 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[23:08:12] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply
[23:08:25] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[23:10:03] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[23:10:10] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[23:25:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217708 (10EBomani) Hello @Dzahn, got your Email and sent over a verification response. Thanks for getting to that so swiftly :))   @thcipriani and I are going to meet next week for the next...
[23:29:08] <mutante>	 !log releases2003 - re-enabling puppet which was disabled for debugging T405352 - then the deployment server failover happened and this server didn't get the update what the active deployment server was.. which subsequently caused T405646
[23:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:17] <stashbot>	 T405352: APT error when installing Jenkins package in releases instances - https://phabricator.wikimedia.org/T405352
[23:29:17] <stashbot>	 T405646: SystemdUnitFailed - rsync on releases2003 - https://phabricator.wikimedia.org/T405646
[23:38:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191537
[23:38:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191537 (owner: 10TrainBranchBot)
[23:44:50] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:57:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191537 (owner: 10TrainBranchBot)
[23:59:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217753 (10Dzahn) @EBomani Thank you. I received your response. We can check that box as well :)  I will make a patch tomorrow and next week's clinic duty person can merge it.
[23:59:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217754 (10Dzahn)
[23:59:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217755 (10Dzahn) a:05EBomani→03Dzahn