[00:29:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:30:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:31:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:34:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:36:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:36:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:39:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:40:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:40:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:41:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:45:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:50:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:51:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:00:04] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268295 (owner: 10TrainBranchBot)
[01:05:25] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1023-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:09:50] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268698
[01:09:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268698 (owner: 10TrainBranchBot)
[01:15:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:21:57] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268698 (owner: 10TrainBranchBot)
[02:00:55] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:07:07] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 12s)
[02:09:15] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:15] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:35:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:41:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:41:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:44:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:44:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:45:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:48:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:54:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:56:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:59:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:59:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:00:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:02:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:35:25] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1023-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:37:08] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11797531 (10RobH) I'll put a more detailed timeline and update tomorrow but as it stands now:  * unisys engineer showed up at 10am singapore time * swapped mainboard, damaged the CPU bracket and mainb...
[04:41:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:10:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:13:08] <wikibugs>	 (03PS1) 10Marostegui: db1152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268708 (https://phabricator.wikimedia.org/T418561)
[05:13:41] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2142.codfw.wmnet,db1152.eqiad.wmnet with reason: Maintenance
[05:13:43] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268708 (https://phabricator.wikimedia.org/T418561) (owner: 10Marostegui)
[05:15:32] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1152: Reimage
[05:15:32] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[05:15:40] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[05:15:40] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1152: Reimage
[05:15:59] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1152.eqiad.wmnet with OS trixie
[05:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:20:33] <wikibugs>	 (03PS1) 10Anzx: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520)
[05:23:23] <wikibugs>	 (03PS2) 10Anzx: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520)
[05:24:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx)
[05:29:36] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1152.eqiad.wmnet with reason: host reimage
[05:33:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1152.eqiad.wmnet with reason: host reimage
[05:36:08] <wikibugs>	 (03PS3) 10Ayounsi: eqsin routed ganeti: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863)
[05:36:16] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[05:41:13] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] eqsin routed ganeti: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[05:43:13] <wikibugs>	 (03PS3) 10Anzx: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520)
[05:50:12] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1152.eqiad.wmnet with OS trixie
[05:52:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:53:17] <wikibugs>	 (03PS4) 10Anzx: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520)
[05:53:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11797619 (10ayounsi)
[05:57:14] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268832
[05:57:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:57:58] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1152: After reimage
[05:57:58] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[05:58:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268832 (owner: 10Marostegui)
[05:58:13] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[05:58:13] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1152: After reimage
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T0600)
[06:11:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[06:12:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti5007 a routed Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1268834 (https://phabricator.wikimedia.org/T421863)
[06:15:09] <wikibugs>	 (03PS2) 10Anzx: wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268833 (https://phabricator.wikimedia.org/T421770)
[06:15:15] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance backend on cp6016 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[06:16:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[06:16:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268833 (https://phabricator.wikimedia.org/T421770) (owner: 10Anzx)
[06:16:15] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance backend on cp6016 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[06:16:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx)
[06:40:40] <wikibugs>	 (03PS1) 10Hashar: ci: enhance ci-build-images script [puppet] - 10https://gerrit.wikimedia.org/r/1268594 (https://phabricator.wikimedia.org/T422488)
[06:59:25] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11797721 (10MoritzMuehlenhoff) @MMigurski-WMF The developer account needs to be linked to your @wikimedia.org email address. Please log into https://idm.wikimedia.org/ and then unde...
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T0700).
[07:00:05] <jouncebot>	 WMDE-Fisch and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:10] <anzx>	 o/
[07:00:30] <WMDE-Fisch>	 o/
[07:01:05] <WMDE-Fisch>	 I could selve serve
[07:01:40] <WMDE-Fisch>	 Starting with my change now.
[07:01:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle)
[07:04:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268514 (https://phabricator.wikimedia.org/T420938) (owner: 10WMDE-Fisch)
[07:05:49] <wikibugs>	 (03Merged) 10jenkins-bot: Enable sub-references on Czech and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268514 (https://phabricator.wikimedia.org/T420938) (owner: 10WMDE-Fisch)
[07:06:45] <logmsgbot>	 !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1268514|Enable sub-references on Czech and Italian wiki (T420938)]]
[07:06:49] <stashbot>	 T420938: Deploy Sub-referencing to itwiki and cswiki - https://phabricator.wikimedia.org/T420938
[07:08:42] <logmsgbot>	 !log wmde-fisch@deploy1003 wmde-fisch: Backport for [[gerrit:1268514|Enable sub-references on Czech and Italian wiki (T420938)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:11:14] <logmsgbot>	 !log wmde-fisch@deploy1003 wmde-fisch: Continuing with sync
[07:13:16] <wikibugs>	 07sre-alert-triage, 06Quality-and-Test-Engineering-Team: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422581 (10LSobanski) 03NEW
[07:13:21] <wikibugs>	 07sre-alert-triage, 06Quality-and-Test-Engineering-Team: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422582 (10LSobanski) 03NEW
[07:14:12] <wikibugs>	 07sre-alert-triage, 06Quality-and-Test-Engineering-Team: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422581#11797758 (10LSobanski) Considering this is a critical alert that has been firing for a month, should it be downgraded or removed?
[07:14:23] <wikibugs>	 07sre-alert-triage, 06Quality-and-Test-Engineering-Team: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422582#11797759 (10LSobanski) Considering this is a critical alert that has been firing for a month, should it be downgraded or removed?
[07:15:29] <logmsgbot>	 !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268514|Enable sub-references on Czech and Italian wiki (T420938)]] (duration: 08m 44s)
[07:15:32] <stashbot>	 T420938: Deploy Sub-referencing to itwiki and cswiki - https://phabricator.wikimedia.org/T420938
[07:16:20] <WMDE-Fisch>	 I'm done. 
[07:18:24] <Krinkle>	 I could self-serve as well, but anzx was first.
[07:18:45] <anzx>	 need someone to deploy for me 
[07:18:57] <WMDE-Fisch>	 anzx:  I could do your patches as well, but I have no clue what they are doing exactly and there's no +1 from anybody else :think
[07:19:02] <WMDE-Fisch>	 anzx:  I could do your patches as well, but I have no clue what they are doing exactly and there's no +1 from anybody else :thinking:
[07:19:05] <moritzm>	 !log installing openssl security updates
[07:19:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:06] <WMDE-Fisch>	 anzx:  I could do your patches as well, but I have no clue what they are doing exactly and there's no +1 from anybody else 🤔
[07:19:12] <WMDE-Fisch>	 Ahh
[07:21:15] <wikibugs>	 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11797798 (10matej_suchanek) >>! In T421642#11785461, @Xqt wrote: > The problems began on March 25th: > {F74901675}  Please (re)attach the file, so that it's visible if i...
[07:21:22] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Make ganeti5007 a routed Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1268834 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:23:44] <anzx>	 WMDE-Fisch , i can schedule it for later if you couldn't deploy 
[07:24:45] <WMDE-Fisch>	 I'm more confident with the Wikimania wiki part. I'll do that then you already have half of the job done ;-)
[07:24:51] <wikibugs>	 (03CR) 10Ayounsi: "Using RIPE Atlas:" [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[07:25:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268833 (https://phabricator.wikimedia.org/T421770) (owner: 10Anzx)
[07:25:32] <anzx>	 WMDE-Fisch: ok
[07:26:13] <wikibugs>	 (03Merged) 10jenkins-bot: wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268833 (https://phabricator.wikimedia.org/T421770) (owner: 10Anzx)
[07:26:37] <logmsgbot>	 !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1268833|wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup (T421770)]]
[07:26:41] <stashbot>	 T421770: wikimaniawiki: add editsemiprotected to extendedconfirmed group - https://phabricator.wikimedia.org/T421770
[07:28:26] <logmsgbot>	 !log wmde-fisch@deploy1003 wmde-fisch, anzx: Backport for [[gerrit:1268833|wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup (T421770)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:28:30] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/aux-codfw: maintenance
[07:28:31] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/aux-codfw: maintenance
[07:28:32] <WMDE-Fisch>	 anzx: Want to test anything with that patch?
[07:29:09] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: disable connection re-use [puppet] - 10https://gerrit.wikimedia.org/r/1268557 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[07:29:15] <anzx>	 WMDE-Fisch: looks good to sync
[07:29:19] <logmsgbot>	 !log wmde-fisch@deploy1003 wmde-fisch, anzx: Continuing with sync
[07:33:31] <logmsgbot>	 !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268833|wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup (T421770)]] (duration: 06m 54s)
[07:33:35] <stashbot>	 T421770: wikimaniawiki: add editsemiprotected to extendedconfirmed group - https://phabricator.wikimedia.org/T421770
[07:33:50] <WMDE-Fisch>	 anzx:  Done. :-)
[07:33:56] <anzx>	 thank you 
[07:33:59] <wikibugs>	 (03PS3) 10Majavah: hieradata: service: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040)
[07:33:59] <wikibugs>	 (03PS3) 10Majavah: O:dumps::distribution::server: Configure as LVS realserver [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040)
[07:33:59] <wikibugs>	 (03PS3) 10Majavah: hieradata: Move dumps to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1268506 (https://phabricator.wikimedia.org/T422040)
[07:34:00] <wikibugs>	 (03PS4) 10Majavah: hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040)
[07:34:03] <WMDE-Fisch>	 Krinkle:  You can go on then. 
[07:34:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle)
[07:35:22] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: service: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[07:35:35] <wikibugs>	 (03CR) 10Majavah: [C:03+2] O:dumps::distribution::server: Configure as LVS realserver [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[07:35:57] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[07:36:03] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgTrackMediaRequestProvenance on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle)
[07:36:29] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]]
[07:36:32] <stashbot>	 T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338
[07:38:18] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:40:39] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin routed ganeti IPs - ayounsi@cumin1003"
[07:40:44] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin routed ganeti IPs - ayounsi@cumin1003"
[07:40:44] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:40:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:41:54] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[07:46:04] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] (duration: 09m 34s)
[07:46:07] <stashbot>	 T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338
[07:48:14] <wikibugs>	 (03CR) 10Majavah: [C:03+2] "looks good: https://phabricator.wikimedia.org/P90323" [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[07:48:26] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/aux-codfw: maintenance
[07:48:26] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/aux-codfw: maintenance
[07:53:00] <wikibugs>	 (03PS1) 10Slyngshede: CSS: Improve footer on mobile [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1268895 (https://phabricator.wikimedia.org/T422203)
[07:54:26] <wikibugs>	 (03PS1) 10Elukey: service: allow k8s-ingress-aux to be depooled [puppet] - 10https://gerrit.wikimedia.org/r/1268896 (https://phabricator.wikimedia.org/T414486)
[07:55:04] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] service: allow k8s-ingress-aux to be depooled [puppet] - 10https://gerrit.wikimedia.org/r/1268896 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[07:55:49] <wikibugs>	 (03CR) 10Elukey: [C:03+2] service: allow k8s-ingress-aux to be depooled [puppet] - 10https://gerrit.wikimedia.org/r/1268896 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[07:55:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make ganeti5007 a routed Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1268834 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[07:56:29] <moritzm>	 elukey: okay to merge your ingress change along now?
[07:56:38] <elukey>	 +1
[07:57:42] <moritzm>	 and merged
[08:00:05] <jouncebot>	 dancy and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T0800).
[08:01:22] <wikibugs>	 (03CR) 10ArielGlenn: "one question, couple typos noted" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (owner: 10Daniel Kinzler)
[08:02:37] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/aux-codfw: maintenance
[08:03:12] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/aux-codfw: maintenance
[08:04:22] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster aux-codfw: Kubernetes upgrade
[08:04:31] <wikibugs>	 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11797895 (10jcrespo) >>! In T419970#11795620, @Jhancock.wm wrote: > @jcrespo would loading the disks from a foreign config be acceptable for...
[08:06:25] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Add istio 1.24 config for k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267568 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[08:06:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: upgrade aux-k8s-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265427 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[08:06:55] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Upgrade aux-k8s-codfw to k8s 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1265426 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[08:08:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl2003.codfw.wmnet are marked down but pooled: k8s-ingress-aux_30443: Servers aux-k8s-worker2003.codfw.wmnet, aux-k8s-worker2005.codfw.wmnet, aux-k8s-worker2002.codfw.wmnet, aux-k8s-worker2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:08:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl2003.codfw.wmnet are marked down but pooled: k8s-ingress-aux_30443: Servers aux-k8s-worker2003.codfw.wmnet, aux-k8s-worker2002.codfw.wmnet, aux-k8s-worker2004.codfw.wmnet, aux-k8s-worker2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:08:47] <elukey>	 this is me --^
[08:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti5007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:13:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet
[08:15:06] <wikibugs>	 (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1262062 (https://phabricator.wikimedia.org/T421402)
[08:15:12] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262062 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[08:16:27] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:17:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 03 Jun 2026 06:56:12 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:18:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1268895 (https://phabricator.wikimedia.org/T422203) (owner: 10Slyngshede)
[08:19:51] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'.
[08:20:18] <logmsgbot>	 elukey@cumin1003 wipe-cluster (PID 3700395) is awaiting input
[08:20:46] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'.
[08:22:50] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'.
[08:23:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet
[08:24:43] <wikibugs>	 (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268897 (https://phabricator.wikimedia.org/T421972)
[08:24:54] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'.
[08:25:56] <wikibugs>	 (03PS2) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268897 (https://phabricator.wikimedia.org/T421972)
[08:26:26] <wikibugs>	 (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.8 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268898 (https://phabricator.wikimedia.org/T421972)
[08:28:20] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] CSS: Improve footer on mobile [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1268895 (https://phabricator.wikimedia.org/T422203) (owner: 10Slyngshede)
[08:30:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: kube-scheduler.service on aux-k8s-ctrl2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:31:03] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: sync
[08:31:17] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: sync
[08:31:28] <wikibugs>	 (03PS1) 10Ayounsi: Add PTR includes for eqsin routed ganeti ranges [dns] - 10https://gerrit.wikimedia.org/r/1268899 (https://phabricator.wikimedia.org/T421863)
[08:31:47] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/kafka-mirrormaker: sync
[08:32:09] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: sync
[08:32:34] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: sync
[08:32:42] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: sync
[08:32:51] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: sync
[08:33:03] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: sync
[08:33:28] <logmsgbot>	 elukey@cumin1003 wipe-cluster (PID 3700395) is awaiting input
[08:34:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:34:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:35:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[08:35:40] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aqs1023-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:36:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[08:37:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[08:39:34] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: sync
[08:40:09] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: sync
[08:40:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:41:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:41:36] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster aux-codfw: Kubernetes upgrade
[08:42:20] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[08:44:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Matches what is in Netbox, looks good" [dns] - 10https://gerrit.wikimedia.org/r/1268899 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[08:44:34] <logmsgbot>	 ayounsi@cumin1003 reimage (PID 3704057) is awaiting input
[08:45:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[08:47:33] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/aux-codfw: maintenance
[08:47:57] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/aux-codfw: maintenance
[08:51:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add PTR includes for eqsin routed ganeti ranges [dns] - 10https://gerrit.wikimedia.org/r/1268899 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi)
[08:51:01] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[08:51:25] <logmsgbot>	 !log ayounsi@dns1004 START - running authdns-update
[08:51:49] <wikibugs>	 (03PS1) 10Elukey: service: allow k8s-ingress-aux-rw to be active/active [puppet] - 10https://gerrit.wikimedia.org/r/1268902 (https://phabricator.wikimedia.org/T414486)
[08:52:46] <logmsgbot>	 !log ayounsi@dns1004 END - running authdns-update
[08:53:03] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS trixie
[08:53:09] <logmsgbot>	 !log taavi@cumin1003 conftool action : set/pooled=yes; selector: name=clouddumps1001.wikimedia.org
[08:54:14] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cr-cloud-vrf: Remove clouddumps NAT exemption rule [homer/public] - 10https://gerrit.wikimedia.org/r/1268516 (owner: 10Majavah)
[08:56:40] <wikibugs>	 (03CR) 10Elukey: tox: rework venvs to speed up local and CI timings (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[08:56:51] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Configure autoscaling for rr-multilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268903 (https://phabricator.wikimedia.org/T415892)
[08:59:48] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Configure autoscaling for rr-multilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268903 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[09:00:01] <taavi>	 !log remove unused cloud-vrf clouddumps cr firewall rule https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1268516
[09:00:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:01] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Configure autoscaling for rr-multilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268903 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[09:12:24] <wikibugs>	 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11798075 (10Xqt) >>! In T421642#11797798, @matej_suchanek wrote: >  > Please (re)attach the file, so that it's visible if it's important ([[ https://www.mediawiki.org/wi...
[09:19:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[09:20:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:23:49] <wikibugs>	 (03CR) 10JMeybohm: "LGTM but could you please add a note about this to https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_control-planes ?" [dns] - 10https://gerrit.wikimedia.org/r/1265480 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine)
[09:24:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[09:25:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 17.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:27:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[09:41:35] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2006.codfw.wmnet with OS trixie
[09:45:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:47:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[09:50:20] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[09:54:21] <fabfur>	 !log upgrading haproxy to version 3.2.15 on magru,drmrs,ulsfo (T421402)
[09:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:24] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[09:55:15] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-magru - 3.2.15 upgrade (T421402)
[09:55:19] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-drmrs - 3.2.15 upgrade (T421402)
[09:55:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[09:55:23] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-ulsfo - 3.2.15 upgrade (T421402)
[09:56:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[09:58:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:04-1] "This contradicts both what we do at the edge and our policies - if someone doesn't change the default UA of program they're using, it make" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268520 (https://phabricator.wikimedia.org/T422471) (owner: 10Daniel Kinzler)
[09:58:55] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS trixie
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1000)
[10:00:05] <jouncebot>	 dues: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[10:02:28] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1011.eqiad.wmnet with OS bullseye
[10:02:38] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798210 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1011.eq...
[10:02:59] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-fe1011
[10:03:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[10:08:46] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1011 - mvernon@cumin2002"
[10:08:51] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1011 - mvernon@cumin2002"
[10:08:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:08:52] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-fe1011.eqiad.wmnet 182.32.64.10.in-addr.arpa 2.8.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:08:56] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-fe1011.eqiad.wmnet 182.32.64.10.in-addr.arpa 2.8.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:08:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1011
[10:11:03] <hnowlan>	 There's a pending restbase deploy that I will steal this window for if it's not being used
[10:11:10] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1011
[10:11:11] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1011
[10:11:19] <hnowlan>	 it'll only affect mathoid so it's very low risk if other things move in parallel 
[10:12:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:12:37] <logmsgbot>	 !log hnowlan@deploy1003 Started deploy [restbase/deploy@dcc15be]: Add urwikisource T415975
[10:12:40] <stashbot>	 T415975: Add urwikisource to RESTBase - https://phabricator.wikimedia.org/T415975
[10:14:08] <logmsgbot>	 !log hnowlan@deploy1003 Finished deploy [restbase/deploy@dcc15be]: Add urwikisource T415975 (duration: 01m 31s)
[10:15:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:18:30] <hnowlan>	 (all done)
[10:25:28] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage
[10:29:04] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage
[10:30:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596 (10MoritzMuehlenhoff) 03NEW
[10:30:09] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336)
[10:30:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11798260 (10MoritzMuehlenhoff) p:05Triage→03High
[10:30:20] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] Grant sudo privileges for the analytics-fr-tech-users group [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis)
[10:35:52] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake)
[10:36:54] <A_smart_kitten>	 ty hnowlan :)
[10:37:55] <wikibugs>	 (03CR) 10Blake: [C:03+2] mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake)
[10:40:02] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake)
[10:41:55] <logmsgbot>	 !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply
[10:42:01] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268058 (owner: 10PipelineBot)
[10:42:20] <logmsgbot>	 !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[10:42:21] <logmsgbot>	 !log blake@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[10:42:41] <logmsgbot>	 !log blake@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[10:44:00] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268058 (owner: 10PipelineBot)
[10:48:46] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2006.codfw.wmnet with OS trixie
[10:52:07] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1011.eqiad.wmnet with OS bullseye
[10:52:18] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1011.eqiad....
[10:52:38] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe[1009-1010,1012-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[10:55:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti5007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:00:04] <jouncebot>	 mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1100).
[11:01:09] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe[1009-1010,1012-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[11:10:48] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply
[11:11:03] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798427 (10MatthewVernon)
[11:11:05] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:11:14] <moritzm>	 !log installing Tomcat security updates
[11:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:49] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798430 (10MatthewVernon) A wrinkle here is that ferm doesn't get reloaded on the other swift nodes (presumably because th...
[11:13:04] <wikibugs>	 (03PS1) 10Mvolz: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268921
[11:13:16] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268921 (owner: 10Mvolz)
[11:14:30] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Close mailing list editing-team@lists.wikimedia.org - https://phabricator.wikimedia.org/T422562#11798438 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}}
[11:15:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268921 (owner: 10Mvolz)
[11:15:42] <moritzm>	 !log installing dpkg security updates
[11:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:01] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798465 (10taavi) >>! In T421719#11798427, @MatthewVernon wrote: > A wrinkle here is that ferm doesn't get reloaded on the...
[11:23:32] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-ulsfo - 3.2.15 upgrade (T421402)
[11:23:38] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[11:30:24] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy autoscaling for rr-mulgilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268923 (https://phabricator.wikimedia.org/T415892)
[11:33:03] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy autoscaling for rr-mulgilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268923 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[11:35:00] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy autoscaling for rr-mulgilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268923 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[11:35:38] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[11:35:49] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] machinetranslation: Remove networkpolicies for people* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264671 (https://phabricator.wikimedia.org/T335491) (owner: 10JMeybohm)
[11:37:49] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: Remove networkpolicies for people* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264671 (https://phabricator.wikimedia.org/T335491) (owner: 10JMeybohm)
[11:38:57] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-magru - 3.2.15 upgrade (T421402)
[11:39:00] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[11:39:26] <kart_>	 Deploying MinT; Minor changes.
[11:41:23] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[11:41:30] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[11:42:14] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-drmrs - 3.2.15 upgrade (T421402)
[11:42:44] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[11:42:52] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[11:43:33] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[11:43:38] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[11:44:28] <kart_>	 !log machinetranslation: Remove networkpolicies for people* (T335491)
[11:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:31] <stashbot>	 T335491: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491
[11:45:25] <jinxer-wm>	 FIRING: [24x] ProbeDown: Service aqs1023-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:46:48] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual model on experimental with autoscaling enabled. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268927 (https://phabricator.wikimedia.org/T415892)
[11:48:51] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual model on experimental with autoscaling enabled. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268927 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[11:50:25] <jinxer-wm>	 FIRING: [24x] ProbeDown: Service aqs1023-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:51:15] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual model on experimental with autoscaling enabled. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268927 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[11:53:06] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:07:28] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM wikikube-worker-exp1001.eqiad.wmnet
[12:09:23] <wikibugs>	 (03PS1) 10Gkyziridis: fix empty line [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268933
[12:10:50] <wikibugs>	 (03PS1) 10Jelto: gitlab: do not send gitlab logs to journal/syslog [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589)
[12:13:07] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: shorten Envoy upstream idle timeout to 100s [puppet] - 10https://gerrit.wikimedia.org/r/1268932 (https://phabricator.wikimedia.org/T421827)
[12:13:07] <wikibugs>	 (03CR) 10Arnaudb: "disabling connection reuse at CDN level did not fix `GnuTLS recv error (-54)` happening in CI." [puppet] - 10https://gerrit.wikimedia.org/r/1268932 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[12:13:16] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8390/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto)
[12:13:34] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM wikikube-worker-exp1001.eqiad.wmnet
[12:14:06] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: shorten Envoy upstream idle timeout to 100s [puppet] - 10https://gerrit.wikimedia.org/r/1268932 (https://phabricator.wikimedia.org/T421827)
[12:14:14] <wikibugs>	 (03PS2) 10Effie Mouzeli: mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336)
[12:15:16] <logmsgbot>	 !log mszwarc@deploy1003 mwscript-k8s job started: foreachwikiindblist all backfillInterwikiRightsLog.php --remote-wiki metawiki 20260311190000  # T6055
[12:15:19] <stashbot>	 T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055
[12:15:27] <wikibugs>	 (03PS2) 10Jelto: gitlab: do not send gitlab logs to journal/syslog [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589)
[12:16:48] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8391/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto)
[12:18:35] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8392/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto)
[12:19:12] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] fix empty line [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268933 (owner: 10Gkyziridis)
[12:19:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli)
[12:21:23] <wikibugs>	 (03Merged) 10jenkins-bot: fix empty line [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268933 (owner: 10Gkyziridis)
[12:23:49] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: shorten Envoy upstream idle timeout to 100s [puppet] - 10https://gerrit.wikimedia.org/r/1268932 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[12:24:14] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli)
[12:26:15] <wikibugs>	 (03Merged) 10jenkins-bot: mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli)
[12:26:43] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy both revertrisk models on experimental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268936
[12:27:00] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM wikikube-worker-exp2001.codfw.wmnet
[12:27:32] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply
[12:27:52] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM wikikube-worker-exp2001.codfw.wmnet
[12:28:04] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply
[12:28:37] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[12:29:17] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[12:29:35] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy both revertrisk models on experimental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268936 (owner: 10Gkyziridis)
[12:31:28] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy both revertrisk models on experimental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268936 (owner: 10Gkyziridis)
[12:31:59] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[12:32:06] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[12:32:12] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply
[12:32:16] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply
[12:32:39] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:34:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Update debdeploy to use checkrestart instead of lsof to detect library restarts - https://phabricator.wikimedia.org/T422614 (10MoritzMuehlenhoff) 03NEW
[12:34:10] <wikibugs>	 (03PS1) 10Volans: debdeploy: use cumin v6.0.0 new APIs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268937
[12:38:39] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1268506 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[12:40:11] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Move dumps to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1268506 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[12:40:17] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS trixie
[12:41:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:43:12] <taavi>	 !log restarting pybal on lvs1020
[12:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:25] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aqs1023-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:49:20] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:50:35] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 18 connections established with conf1007.eqiad.wmnet:4001 (min=22) https://wikitech.wikimedia.org/wiki/PyBal
[12:52:06] <wikibugs>	 (03CR) 10Elukey: [C:03+1] debdeploy: use cumin v6.0.0 new APIs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268937 (owner: 10Volans)
[12:53:37] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:53:58] <wikibugs>	 (03PS1) 10Majavah: P:dumps: rsync: Do not use LOAD_BALANCER_HEALTH_CHECKS [puppet] - 10https://gerrit.wikimedia.org/r/1268942
[12:54:52] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1018 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Majavah adding dumps-lb - The acknowledgement expires at: 2026-04-09 14:54:38. https://wikitech.wikimedia.org/wiki/PyBal
[12:54:52] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 18 connections established with conf1007.eqiad.wmnet:4001 (min=22) Majavah adding dumps-lb - The acknowledgement expires at: 2026-04-09 14:54:38. https://wikitech.wikimedia.org/wiki/PyBal
[12:54:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx)
[12:55:24] <Lucas_WMDE>	 cscott: question about the change you scheduled for deployment, just out of interest – is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1268679 not blocked on the same issue as https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1268680 ?
[12:56:05] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Majavah dumps-lb - The acknowledgement expires at: 2026-04-09 13:55:54. https://wikitech.wikimedia.org/wiki/PyBal
[12:56:05] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - dumps-lb_873: Servers clouddumps1001.wikimedia.org are marked down but pooled: dumps-lb6_873: Servers clouddumps1001.wikimedia.org are marked down but pooled Majavah dumps-lb - The acknowledgement expires at: 2026-04-09 13:55:54. https://wikitech.wikimedia.org/wiki/PyBal
[12:56:19] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:dumps: rsync: Do not use LOAD_BALANCER_HEALTH_CHECKS [puppet] - 10https://gerrit.wikimedia.org/r/1268942 (owner: 10Majavah)
[12:56:51] <cscott>	 Lucas_WMDE: eswiki doesn't use flagged revs to the same degree as dewiki, as far as I know
[12:57:10] <Lucas_WMDE>	 ah, I missed the FlaggedRevs connection
[12:57:11] <Lucas_WMDE>	 thanks!
[12:58:07] <cscott>	 Also my fault for not clarifying that there were two parts to that bug: flagged revs and an issue with oldids and the revision cache, and I backported the revision cache fix to WMF.22 yesterday
[12:59:18] <cscott>	 The flagged revs fix is in wmf.23 but it was a little too complicated for a backport, and for cache reasons we'd actually like to do the deploy in 2 steps anyway
[13:00:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:01:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268937 (owner: 10Volans)
[13:01:59] <wikibugs>	 (03CR) 10Volans: [V:03+2 C:03+2] debdeploy: use cumin v6.0.0 new APIs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268937 (owner: 10Volans)
[13:02:54] <anzx>	 jouncebot: now
[13:02:54] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1300)
[13:03:18] <cscott>	 Here
[13:03:22] <cscott>	 I can spiderpig
[13:03:27] <anzx>	 o/
[13:04:24] <cscott>	 anzx: do you want to go first?
[13:04:51] <taavi>	 !log restarting pybal on lvs1018
[13:04:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:19] <anzx>	 cscott: need someone to deploy mine, please go ahead  
[13:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:05:35] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 22 connections established with conf1007.eqiad.wmnet:4001 (min=22) https://wikitech.wikimedia.org/wiki/PyBal
[13:06:15] <cscott>	 Ok
[13:07:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268679 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian)
[13:07:43] <Lucas_WMDE>	 o/
[13:08:00] <Lucas_WMDE>	 sorry, got so distracted that I missed the start of the actual window
[13:08:06] <Lucas_WMDE>	 I can deploy for anzx once cscott is done :)
[13:08:07] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268679 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian)
[13:08:36] <logmsgbot>	 !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1268679|Turn on Parsoid Read Views for eswiki (T422524)]]
[13:08:37] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:08:39] <stashbot>	 T422524: Parsoid Read Views to deploy ~2026-04-07 - https://phabricator.wikimedia.org/T422524
[13:09:28] <wikibugs>	 (03PS1) 10Jelto: gitlab: add feature flag for rsyslog input and disable in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589)
[13:09:35] <phuedx>	 o/ I have a patch that needs backporting and deploying
[13:09:41] <phuedx>	 I'll get the backports ready
[13:09:57] <wikibugs>	 (03PS1) 10Phuedx: PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112)
[13:10:19] <wikibugs>	 (03PS1) 10Phuedx: PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268948 (https://phabricator.wikimedia.org/T422112)
[13:10:34] <logmsgbot>	 !log cscott@deploy1003 cscott: Backport for [[gerrit:1268679|Turn on Parsoid Read Views for eswiki (T422524)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:10:41] <wikibugs>	 (03PS5) 10Majavah: hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040)
[13:11:47] <logmsgbot>	 !log cscott@deploy1003 cscott: Continuing with sync
[13:12:04] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8393/console" [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto)
[13:14:23] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[13:14:44] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[13:15:42] <logmsgbot>	 !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268679|Turn on Parsoid Read Views for eswiki (T422524)]] (duration: 07m 06s)
[13:15:46] <stashbot>	 T422524: Parsoid Read Views to deploy ~2026-04-07 - https://phabricator.wikimedia.org/T422524
[13:15:59] <cscott>	 ok, all done.
[13:16:05] <Lucas_WMDE>	 thanks! I’ll continue
[13:16:08] <cscott>	 over to you, Lucas_WMDE and anzx 
[13:16:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx)
[13:17:58] <Lucas_WMDE>	 hmph, zuul/CI is quite busy
[13:18:05] <Lucas_WMDE>	 hasn’t even started the gate-and-submit jobs yet
[13:18:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:19:24] <taavi>	 someone seems to just have pushed a patch with a rather long dependency chain
[13:20:48] <cscott>	 spurious failures aren't helping, since that's causing jenkins to invalidate all its work on the gate-and-submit pipeline and start over
[13:20:51] <wikibugs>	 (03CR) 10Phuedx: "Recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:22:02] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx)
[13:22:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1268820|cswiki: lift IP cap for workshop (T422520)]]
[13:22:27] <stashbot>	 T422520: Lift IP cap on 2026-04-13 for Students Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T422520
[13:24:11] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "small nit inline, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto)
[13:24:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1268820|cswiki: lift IP cap for workshop (T422520)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:24:33] <Lucas_WMDE>	 anzx: anything you want to test on mwdebug?
[13:24:46] <anzx>	 nothing to test
[13:25:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync
[13:25:04] <Lucas_WMDE>	 sounds good
[13:25:35] <Lucas_WMDE>	 phuedx: are you adding your backports to the deployment calendar btw?
[13:25:41] <phuedx>	 Will do
[13:25:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268948 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:26:01] <Lucas_WMDE>	 thanks :)
[13:26:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:26:17] <Lucas_WMDE>	 do you want to deploy them yourself (once the current scap is done) or shall I?
[13:26:27] <moritzm>	 !log upgrade debdeploy-server on cumin2002 to 0.0.99.14-1+deb12u1+exp1 (temporary build with Cumin 6 compat before we have Cumin 6 universally)
[13:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268820|cswiki: lift IP cap for workshop (T422520)]] (duration: 06m 22s)
[13:28:50] <stashbot>	 T422520: Lift IP cap on 2026-04-13 for Students Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T422520
[13:28:59] <Lucas_WMDE>	 phuedx: over to you!
[13:29:11] <wikibugs>	 (03PS1) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730)
[13:30:07] <wikibugs>	 (03PS1) 10Majavah: dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040)
[13:30:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268948 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:30:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:30:41] <Lucas_WMDE>	 or would you like me to deploy?
[13:30:42] <Lucas_WMDE>	 ah ok
[13:31:47] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2006.codfw.wmnet with OS trixie
[13:32:14] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8395/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[13:32:21] <phuedx>	 Thanks Lucas_WMDE
[13:32:51] <wikibugs>	 (03Merged) 10jenkins-bot: PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268948 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:32:53] <wikibugs>	 (03Merged) 10jenkins-bot: PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx)
[13:33:22] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1268948|PHP SDK: Measure known experiments correctly (T422112)]], [[gerrit:1268947|PHP SDK: Measure known experiments correctly (T422112)]]
[13:33:25] <stashbot>	 T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112
[13:33:43] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal-scholarly,name=eqiad
[13:34:17] <wikibugs>	 (03PS1) 10Majavah: wikimedia.org: Send dumps-rsync to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268954 (https://phabricator.wikimedia.org/T422040)
[13:34:19] <wikibugs>	 (03PS1) 10Majavah: wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040)
[13:35:16] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1268948|PHP SDK: Measure known experiments correctly (T422112)]], [[gerrit:1268947|PHP SDK: Measure known experiments correctly (T422112)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:36:26] <wikibugs>	 (03CR) 10Atsuko: [C:04-1] "this is a preliminary diff, no review needed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko)
[13:36:37] <phuedx>	 Looking now
[13:37:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11799464 (10MoritzMuehlenhoff)
[13:37:30] <phuedx>	 I did a quick browse of a couple of wikis and saw no errors/warnings in the logs. Continuing
[13:37:36] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Continuing with sync
[13:41:21] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268948|PHP SDK: Measure known experiments correctly (T422112)]], [[gerrit:1268947|PHP SDK: Measure known experiments correctly (T422112)]] (duration: 07m 58s)
[13:41:24] <stashbot>	 T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112
[13:42:03] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual and langugage-agnostic in experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268963 (https://phabricator.wikimedia.org/T415892)
[13:42:11] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual and langugage-agnostic in experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268963 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[13:42:51] <Lucas_WMDE>	 I think that’s it!
[13:42:56] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:42:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:02] * phuedx watches the logs
[13:43:21] <phuedx>	 Looking good at the moment
[13:44:16] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual and langugage-agnostic in experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268963 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[13:45:39] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:46:22] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] "Gotcha — we're talking about 11 metrics (6 envoy_cluster_update + 5 envoy_dns) across ~1800 scraping-job targets, resulting in roughly 20k" [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm)
[13:51:09] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: add feature flag for rsyslog input and disable in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto)
[13:51:26] <wikibugs>	 (03PS1) 10Volans: debdeploy: fix typo in printed message [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268964
[13:52:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268964 (owner: 10Volans)
[13:52:31] <wikibugs>	 (03CR) 10Volans: [V:03+2 C:03+2] debdeploy: fix typo in printed message [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268964 (owner: 10Volans)
[13:53:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11799697 (10DPogorzelski-WMF) a:05Jclark-ctr→03klausman
[13:55:41] <wikibugs>	 (03PS1) 10Sergio Gimeno: EventStreamConfig: remove unused contextual attributes causing problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001)
[13:59:38] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-04-01-092119 to 2026-04-06-224243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268966 (https://phabricator.wikimedia.org/T421815)
[13:59:40] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-31-162258 to 2026-04-07-234729 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268967 (https://phabricator.wikimedia.org/T407903)
[14:00:43] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] EventStreamConfig: remove unused contextual attributes causing problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno)
[14:02:08] <wikibugs>	 (03PS1) 10Brouberol: deployment_server: remove un-used opensearch-test-codfw kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1268968
[14:02:50] <wikibugs>	 (03PS1) 10Clément Goubert: data.yaml: cgoubert: Replace non-FIDO key with backup [puppet] - 10https://gerrit.wikimedia.org/r/1268970
[14:03:17] <wikibugs>	 (03PS1) 10Atsuko: atsuko: backup Yubikey and krb [puppet] - 10https://gerrit.wikimedia.org/r/1268972
[14:03:19] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[14:04:10] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[14:05:33] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2026-04-01-092119 to 2026-04-06-224243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268966 (https://phabricator.wikimedia.org/T421815) (owner: 10Jforrester)
[14:07:05] <wikibugs>	 (03CR) 10Bking: [C:03+1] deployment_server: remove un-used opensearch-test-codfw kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1268968 (owner: 10Brouberol)
[14:07:24] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:07:36] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Yubikey pubkey validated out of band, and +1 on the kerberos addition" [puppet] - 10https://gerrit.wikimedia.org/r/1268972 (owner: 10Atsuko)
[14:07:38] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:07:47] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-04-01-092119 to 2026-04-06-224243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268966 (https://phabricator.wikimedia.org/T421815) (owner: 10Jforrester)
[14:08:30] <logmsgbot>	 !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:09:05] <logmsgbot>	 !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:10:17] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] atsuko: backup Yubikey and krb [puppet] - 10https://gerrit.wikimedia.org/r/1268972 (owner: 10Atsuko)
[14:10:19] <logmsgbot>	 !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:10:39] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] "merging" [puppet] - 10https://gerrit.wikimedia.org/r/1268972 (owner: 10Atsuko)
[14:11:04] <logmsgbot>	 !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:11:14] <logmsgbot>	 !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:11:54] <logmsgbot>	 !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:13:35] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-31-162258 to 2026-04-07-234729 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268967 (https://phabricator.wikimedia.org/T407903) (owner: 10Jforrester)
[14:15:55] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-31-162258 to 2026-04-07-234729 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268967 (https://phabricator.wikimedia.org/T407903) (owner: 10Jforrester)
[14:16:32] <logmsgbot>	 !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:17:13] <logmsgbot>	 !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:17:58] <logmsgbot>	 !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:18:34] <logmsgbot>	 !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:18:45] <logmsgbot>	 !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:19:16] <logmsgbot>	 !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:19:51] <wikibugs>	 (03PS2) 10Majavah: dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040)
[14:19:51] <wikibugs>	 (03PS1) 10Majavah: hieradata: Fix dumps http probe [puppet] - 10https://gerrit.wikimedia.org/r/1268978
[14:19:51] <wikibugs>	 (03PS1) 10Majavah: hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979
[14:20:17] <wikibugs>	 (03PS2) 10Majavah: hieradata: Fix dumps http probe [puppet] - 10https://gerrit.wikimedia.org/r/1268978 (https://phabricator.wikimedia.org/T422040)
[14:20:19] <wikibugs>	 (03PS3) 10Majavah: dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040)
[14:20:19] <wikibugs>	 (03PS2) 10Majavah: hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979
[14:22:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] deployment_server: remove un-used opensearch-test-codfw kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1268968 (owner: 10Brouberol)
[14:24:06] <wikibugs>	 (03PS1) 10Majavah: Revert "P:toolforge::prometheus: Disable istio-gateway scrape for now" [puppet] - 10https://gerrit.wikimedia.org/r/1268981 (https://phabricator.wikimedia.org/T421386)
[14:25:25] <jinxer-wm>	 RESOLVED: ProbeDown: Service aqs1023-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1023-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1400)
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1430)
[14:31:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] wikimedia.org: Send dumps-rsync to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268954 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[14:31:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[14:31:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[14:31:54] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1262062 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[14:32:09] <fabfur>	 !log upgrading eqsin to haproxy 3.2 (T421402) 
[14:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:12] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[14:33:49] <wikibugs>	 (03CR) 10Scott French: [C:03+2] wikikube: Temporarily double coredns replicas (12) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268573 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French)
[14:34:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hieradata: Fix dumps http probe [puppet] - 10https://gerrit.wikimedia.org/r/1268978 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[14:35:48] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Fix dumps http probe [puppet] - 10https://gerrit.wikimedia.org/r/1268978 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[14:35:59] <wikibugs>	 (03CR) 10Majavah: [C:03+2] dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[14:36:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org
[14:37:10] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin - 3.2 upgrade (T421402)
[14:37:13] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[14:37:25] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin - 3.2 upgrade (T421402)
[14:38:31] <wikibugs>	 (03PS1) 10Majavah: dumps: web: Remove plaintext HTTP server [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672)
[14:39:35] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8396/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672) (owner: 10Majavah)
[14:39:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org
[14:40:25] <wikibugs>	 (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1262063 (https://phabricator.wikimedia.org/T421402)
[14:40:30] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262063 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[14:40:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:40:55] <wikibugs>	 (03CR) 10Majavah: [C:03+2] wikimedia.org: Send dumps-rsync to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268954 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[14:41:08] <logmsgbot>	 !log taavi@dns1004 START - running authdns-update
[14:41:31] <wikibugs>	 (03Merged) 10jenkins-bot: wikikube: Temporarily double coredns replicas (12) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268573 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French)
[14:41:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1268970 (owner: 10Clément Goubert)
[14:42:27] <logmsgbot>	 !log taavi@dns1004 END - running authdns-update
[14:42:29] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] data.yaml: cgoubert: Replace non-FIDO key with backup [puppet] - 10https://gerrit.wikimedia.org/r/1268970 (owner: 10Clément Goubert)
[14:46:20] <wikibugs>	 (03CR) 10Elukey: "Tested in https://phabricator.wikimedia.org/T420993#11799738" [puppet] - 10https://gerrit.wikimedia.org/r/1265382 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[14:47:59] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:48:56] <taavi>	 !log serve dumps rsync traffic via new LVS service T422040
[14:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:58] <stashbot>	 T422040: Migrate clouddumps https/rsync interfaces behind LVS - https://phabricator.wikimedia.org/T422040
[14:49:43] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:53:30] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:56:16] <wikibugs>	 (03PS1) 10Elukey: ipmi: allow to run commands as another user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1268997 (https://phabricator.wikimedia.org/T418929)
[14:57:25] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:58:42] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:58:43] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:00:03] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1268997 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[15:00:47] <logmsgbot>	 !log derick@deploy1003 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=zhwiki --logwiki=metawiki 'Mr Kazi Tuhin' KaziHasanTuhin  # T422677
[15:00:50] <stashbot>	 T422677: Unblock stuck global rename of KaziHasanTuhin - https://phabricator.wikimedia.org/T422677
[15:05:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1081.eqiad.wmnet with OS bullseye
[15:05:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1081
[15:06:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:07:57] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery)
[15:10:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1081 - bking@cumin2002"
[15:10:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1081 - bking@cumin2002"
[15:10:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:10:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1081.eqiad.wmnet 166.32.64.10.in-addr.arpa 6.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:10:16] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1081.eqiad.wmnet 166.32.64.10.in-addr.arpa 6.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:10:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1081
[15:11:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1081
[15:11:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1081
[15:13:03] <wikibugs>	 (03PS1) 10Hnowlan: admin: add backup yubikey for hnowlan [puppet] - 10https://gerrit.wikimedia.org/r/1268999
[15:14:23] <wikibugs>	 (03PS2) 10Hnowlan: admin: add backup yubikey for hnowlan, remove legacy key [puppet] - 10https://gerrit.wikimedia.org/r/1268999
[15:14:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Indeed, and that also happens fairly often any (e.g. when one of the base sets get modified, which happens often)." [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway)
[15:16:23] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:16:27] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:16:30] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcephmon2004-dev.codfw.wmnet
[15:16:32] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:16:58] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:17:00] <wikibugs>	 (03CR) 10Majavah: [C:03+1] nftables: cleanup tests [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway)
[15:17:03] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:17:09] <sukhe>	 !log sukhe@lvs1020:~$ sudo systemctl restart pybal.service 
[15:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:22] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:17:29] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:17:58] <wikibugs>	 (03CR) 10Bking: [C:03+2] bking: add some helpers to dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1268672 (owner: 10Bking)
[15:17:59] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:18:04] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:18:41] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:18:49] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:19:03] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:19:10] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:19:17] <wikibugs>	 (03CR) 10Majavah: "most of the failures listed on https://puppet-compiler.wmflabs.org/output/1211651/6231/, so https://puppet-compiler.wmflabs.org/output/121" [puppet] - 10https://gerrit.wikimedia.org/r/1266205 (owner: 10Majavah)
[15:19:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1087.eqiad.wmnet with OS bullseye
[15:20:27] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1087
[15:20:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:26:03] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.dns.netbox
[15:26:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1087 - bking@cumin2002"
[15:27:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1081.eqiad.wmnet with reason: host reimage
[15:27:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1087 - bking@cumin2002"
[15:27:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:27:27] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1087.eqiad.wmnet 174.32.64.10.in-addr.arpa 4.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:27:30] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1087.eqiad.wmnet 174.32.64.10.in-addr.arpa 4.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:27:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1087
[15:28:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1087
[15:28:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1087
[15:28:13] <wikibugs>	 (03CR) 10CDobbins: "`" [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[15:28:29] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ipmi: allow to run commands as another user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1268997 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[15:28:46] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:28:49] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephmon2004-dev.codfw.wmnet
[15:29:18] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin - 3.2 upgrade (T421402)
[15:29:21] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[15:30:57] <wikibugs>	 (03CR) 10Cathal Mooney: "Agreed I don't think based on the above or the data Chris shared we can justify sending all of RE to drmrs.  If it had been close perhaps " [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[15:32:13] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon2004-dev - https://phabricator.wikimedia.org/T422437#11800245 (10Andrew) a:05Andrew→03None
[15:35:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1081.eqiad.wmnet with reason: host reimage
[15:35:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1268999 (owner: 10Hnowlan)
[15:36:45] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin - 3.2 upgrade (T421402)
[15:36:48] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[15:39:06] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1262063 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[15:39:13] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] admin: add backup yubikey for hnowlan, remove legacy key [puppet] - 10https://gerrit.wikimedia.org/r/1268999 (owner: 10Hnowlan)
[15:39:41] <hnowlan>	 fabfur: think I caught your change, okay to merge I assume? 
[15:39:53] <fabfur>	 yep thanks
[15:40:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:41:51] <fabfur>	 !log upgrading codfw to haproxy 3.2 (T421402) 
[15:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:58] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[15:42:06] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw - 3.2 upgrade (T421402)
[15:42:12] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw - 3.2 upgrade (T421402)
[15:43:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1087.eqiad.wmnet with reason: host reimage
[15:48:30] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[15:49:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1087.eqiad.wmnet with reason: host reimage
[15:52:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS trixie
[15:52:28] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1023.eqiad.wmnet
[15:52:29] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1023.eqiad.wmnet
[15:55:57] <wikibugs>	 (03PS1) 10Elukey: sre.network: add workaround for dry-run in run_junos_commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011
[16:00:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1081.eqiad.wmnet with OS bullseye
[16:04:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11800391 (10jcrespo) Let me know when you can.
[16:07:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1087.eqiad.wmnet with OS bullseye
[16:09:15] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:12:21] <wikibugs>	 (03PS49) 10CDobbins: (traffic): add alert for depooled cp* hosts [alerts] - 10https://gerrit.wikimedia.org/r/1217262
[16:12:30] <wikibugs>	 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#11800430 (10Andrew) The only thing left to do here (that I know if) is relative links being messed up in the initial wikitech-static landing pag...
[16:14:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11800433 (10cmooney) Ok Juniper came back with the following: ` I found that your version 23.4R2-S7.4 is hitting the PR1933049. Unfortunately, this is a confidential PR, but in order to get thi...
[16:15:17] <wikibugs>	 (03CR) 10Volans: sre.network: add workaround for dry-run in run_junos_commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 (owner: 10Elukey)
[16:19:14] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw - 3.2 upgrade (T421402)
[16:19:17] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[16:20:27] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw - 3.2 upgrade (T421402)
[16:20:44] <wikibugs>	 (03PS6) 10Eevans: aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830)
[16:20:44] <wikibugs>	 (03PS7) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830)
[16:20:44] <wikibugs>	 (03PS7) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830)
[16:20:45] <wikibugs>	 (03PS7) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830)
[16:22:23] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[16:34:15] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:38:45] <wikibugs>	 (03PS7) 10Eevans: aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830)
[16:38:46] <wikibugs>	 (03PS8) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830)
[16:38:46] <wikibugs>	 (03PS8) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830)
[16:38:46] <wikibugs>	 (03PS8) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830)
[16:39:03] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[16:41:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:42:36] <wikibugs>	 (03PS2) 10Majavah: dumps: web: Remove plaintext HTTP server [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672)
[16:42:36] <wikibugs>	 (03PS1) 10Majavah: dumps: web: Use 429 for connection limit issues [puppet] - 10https://gerrit.wikimedia.org/r/1269021
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1700)
[17:00:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1088.eqiad.wmnet with OS bullseye
[17:01:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1088
[17:02:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:06:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1088 - bking@cumin2002"
[17:06:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1088 - bking@cumin2002"
[17:06:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:06:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1088.eqiad.wmnet 176.32.64.10.in-addr.arpa 6.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:06:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1088.eqiad.wmnet 176.32.64.10.in-addr.arpa 6.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:06:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1088
[17:07:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1088
[17:07:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1088
[17:08:17] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2002.codfw.wmnet with OS trixie
[17:17:12] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024
[17:17:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 (owner: 10Daniel Kinzler)
[17:19:04] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581)
[17:19:06] <wikibugs>	 (03CR) 10Daniel Kinzler: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[17:19:20] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581)
[17:23:30] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024
[17:23:38] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1088.eqiad.wmnet with reason: host reimage
[17:27:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11801010 (10Papaul) @jcrespo we can do this  next week Wednesday April 15th at 10am CT . Thank you.
[17:29:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1088.eqiad.wmnet with reason: host reimage
[17:35:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1089.eqiad.wmnet with OS bullseye
[17:36:09] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1089.eqiad.wmnet with OS bullseye
[17:36:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye
[17:37:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103
[17:38:30] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[17:39:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:40:11] <wikibugs>	 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11801044 (10Jhancock.wm) a:03Jhancock.wm good news everybody!  there was definitely a power surge on this server. I've been replacing piec...
[17:41:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420708#11801062 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm finally replaced all the parts that got fried in a power surge. powered up and back in the rack.
[17:42:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1103 - bking@cumin2002"
[17:42:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1103 - bking@cumin2002"
[17:42:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:42:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1103.eqiad.wmnet 43.48.64.10.in-addr.arpa 3.4.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:42:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1103.eqiad.wmnet 43.48.64.10.in-addr.arpa 3.4.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:42:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1103
[17:43:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1103
[17:43:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103
[17:45:20] <wikibugs>	 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11801099 (10jcrespo) 05Open→03Resolved @Jhancock.wm I want to thank you deeply the work, a lot! Please note your work will pay off,...
[17:49:01] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:49:30] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1088.eqiad.wmnet with OS bullseye
[18:00:05] <jouncebot>	 dancy and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1800).
[18:00:10] <dancy>	 o/
[18:00:16] <dancy>	 I'm here to press buttons
[18:00:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage
[18:01:42] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269032 (https://phabricator.wikimedia.org/T420481)
[18:01:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269032 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot)
[18:02:45] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269032 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot)
[18:04:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage
[18:08:26] <logmsgbot>	 !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.23  refs T420481
[18:08:30] <stashbot>	 T420481: 1.46.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T420481
[18:16:27] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[18:24:03] <wikibugs>	 (03PS1) 10Jdrewniak: Disable extension:WP25EasterEggs from Wikipedias. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269036 (https://phabricator.wikimedia.org/T422548)
[18:25:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1103.eqiad.wmnet with OS bullseye
[18:32:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye
[18:33:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103
[18:33:15] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103
[18:33:28] <wikibugs>	 (03PS1) 10Jforrester: mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299)
[18:45:27] <wikibugs>	 (03PS2) 10Jforrester: mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299)
[18:46:42] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299) (owner: 10Jforrester)
[18:48:38] <James_F>	 dancy: Is it possible for me to sneak out an MW chart fix now?
[18:49:38] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage
[18:49:53] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 3714.54 ms
[18:50:25] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:50:27] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[18:50:42] <dancy>	 James_F: Yep!
[18:50:47] <James_F>	 Thanks.
[18:50:52] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299) (owner: 10Jforrester)
[18:52:56] <wikibugs>	 (03Merged) 10jenkins-bot: mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299) (owner: 10Jforrester)
[18:53:47] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] "😢 end of an era" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269036 (https://phabricator.wikimedia.org/T422548) (owner: 10Jdrewniak)
[18:54:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage
[18:55:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Add key for my new (and less destroyed) yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1269042
[18:55:25] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:55:54] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply
[18:56:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Add key for my new (and less destroyed) yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1269042 (owner: 10Andrew Bogott)
[18:56:49] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply
[18:57:25] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply
[19:01:32] <logmsgbot>	 !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1024.eqiad.wmnet with reason: Bootstrapping — T412830
[19:01:35] <stashbot>	 T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830
[19:02:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[19:02:34] <wikibugs>	 (03PS2) 10Dzahn: zuul::base: ensure /var/ssh/zuul exists [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938)
[19:02:55] <rzl>	 ^ looking, I think this is just due to James_F's rollout in progress (i.e. due to the rollout itself, not a problem with the new config) but double-checking
[19:04:10] <rzl>	 yep, all looks fine except for the churn, that alert will clear on its own when the deployment finishes
[19:09:59] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply
[19:10:47] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:11:13] <rzl>	 ^ also expected
[19:11:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail, 10Phabricator: Replace Exim on phabricator servers with Postfix - https://phabricator.wikimedia.org/T378029#11801471 (10A_smart_kitten)
[19:12:39] <wikibugs>	 (03PS3) 10Dzahn: zuul::base: ensure /var/ssh/zuul exists [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938)
[19:13:27] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 10Phabricator: Replace Exim on phabricator servers with Postfix - https://phabricator.wikimedia.org/T378029#11801490 (10Dzahn)
[19:14:38] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1103.eqiad.wmnet with OS bullseye
[19:15:25] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:15:47] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:16:26] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephmon2007-dev service implementation - https://phabricator.wikimedia.org/T420282#11801494 (10Andrew) 05Open→03Resolved
[19:17:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[19:17:51] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269046 (https://phabricator.wikimedia.org/T128546)
[19:20:36] <wikibugs>	 (03PS4) 10Dzahn: zuul::base: ensure /var/ssh/zuul exists [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938)
[19:22:37] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[19:23:53] <wikibugs>	 (03CR) 10Bearloga: [C:03+1] EventStreamConfig: remove unused contextual attributes causing problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno)
[19:26:03] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1260847/8397/zuul1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[19:27:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[19:28:42] <rzl>	 ^ still expected, same story
[19:29:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gitlab: add feature flag for rsyslog input and disable in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto)
[19:35:13] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[19:35:47] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:36:39] <rzl>	 I wonder why that keeps firing right *after* the release finishes, something funny about the timing
[19:36:49] <rzl>	 it'll self-resolve again though
[19:36:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudlb haproxy: allow configuring health port for tcp services [puppet] - 10https://gerrit.wikimedia.org/r/1260135 (owner: 10Andrew Bogott)
[19:40:33] <wikibugs>	 (03PS1) 10Ladsgroup: Use envoy for swift inside mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872)
[19:40:47] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:42:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2000). nyaa~
[20:00:05] <jouncebot>	 toyofuku and toyofuku: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:41] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Use envoy for swift inside mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[20:01:02] <jan_drewniak>	 oh wow looks like it's just you and me toyofuku
[20:01:08] <toyofuku>	 haha perfect
[20:01:18] <toyofuku>	 We can do rock paper scissors for who deploys the config patches?
[20:02:30] <jan_drewniak>	 scissors
[20:03:09] <toyofuku>	 rock (:<
[20:03:31] <rzl>	 :o
[20:03:47] <jan_drewniak>	 😂
[20:03:56] <cdanis>	 🪨
[20:06:11] <jan_drewniak>	 toyofuku: lol, ok you win. Both of these can be done at the same time btw, I can verify the portal one and then I'll run a purge command after it's synced
[20:06:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269036 (https://phabricator.wikimedia.org/T422548) (owner: 10Jdrewniak)
[20:06:27] <toyofuku>	 oh oops I just started the one config one
[20:06:32] <toyofuku>	 I can do the other one after if you'd like
[20:06:48] <toyofuku>	 I'm also in eng enclave ftr
[20:07:01] <wikibugs>	 (03PS1) 10Dzahn: zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938)
[20:07:06] <jan_drewniak>	 np! I
[20:07:07] <wikibugs>	 (03Merged) 10jenkins-bot: Disable extension:WP25EasterEggs from Wikipedias. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269036 (https://phabricator.wikimedia.org/T422548) (owner: 10Jdrewniak)
[20:07:33] <logmsgbot>	 !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1269036|Disable extension:WP25EasterEggs from Wikipedias. (T422548)]]
[20:07:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[20:07:36] <stashbot>	 T422548: Deployment: Disable the config flag for extension:WP25EasterEggs - https://phabricator.wikimedia.org/T422548
[20:09:26] <logmsgbot>	 !log toyofuku@deploy1003 jdrewniak, toyofuku: Backport for [[gerrit:1269036|Disable extension:WP25EasterEggs from Wikipedias. (T422548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:10:01] <Amir1>	 when you're done, please ping me. I have a fun patch to push
[20:10:27] <toyofuku>	 Verifying on testservers
[20:10:37] <toyofuku>	 bye bye babyglobe ):
[20:12:03] <jan_drewniak>	 yup. Looks like enwiki disabled it already, but on mwdebug the config page is now gone too! (as expected)
[20:12:03] <jan_drewniak>	  https://en.wikipedia.org/wiki/Special:CommunityConfiguration/WP25EasterEggs 
[20:12:25] <toyofuku>	 yeah I was about to say
[20:12:31] <toyofuku>	 I'm being gaslit by community config
[20:12:51] <toyofuku>	 luckily I speak other languages
[20:13:08] <logmsgbot>	 !log toyofuku@deploy1003 jdrewniak, toyofuku: Continuing with sync
[20:13:17] <toyofuku>	 looks good, moving on
[20:15:31] <wikibugs>	 (03PS2) 10Dzahn: zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938)
[20:17:00] <logmsgbot>	 !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269036|Disable extension:WP25EasterEggs from Wikipedias. (T422548)]] (duration: 09m 27s)
[20:17:03] <stashbot>	 T422548: Deployment: Disable the config flag for extension:WP25EasterEggs - https://phabricator.wikimedia.org/T422548
[20:18:35] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1269053/8398/zuul1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[20:18:53] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons file not found - https://phabricator.wikimedia.org/T413507#11801663 (10Jeff_G) Another file seems damaged, https://commons.wikimedia.org/wiki/File:Ciclo_de_vida_de_Daphnia_magna_(pulga_de_agua)-es.svg (both versions show "Thumbnail for version as of". For the origina...
[20:19:18] <jan_drewniak>	 toyofuku: looks like that's done. Going to start the portal deploy now
[20:19:36] <toyofuku>	 Thank you!  Sorry switching gears to focus on eng enclave
[20:19:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269046 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[20:20:44] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269046 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[20:21:01] <jan_drewniak>	 toyofuku: thanks for deploying! and good game of rock paper scissors, I'll get you next time :P 
[20:21:11] <logmsgbot>	 !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]]
[20:21:14] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[20:23:09] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:24:22] <wikibugs>	 (03PS1) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664)
[20:24:23] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Continuing with sync
[20:25:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite)
[20:26:21] <jan_drewniak>	 oy!
[20:26:21] <jan_drewniak>	 ```
[20:26:21] <jan_drewniak>	 20:24:23 Started sync-canaries-k8s
[20:26:21] <jan_drewniak>	 20:24:26 K8s deployment progress:   0% (ok: 0; fail: 0; left: 60)
[20:26:21] <jan_drewniak>	 20:24:32 Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
[20:26:21] <jan_drewniak>	 20:24:32 Stdout/stderr follows:
[20:26:21] <jan_drewniak>	 20:24:32 skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-api-ext/codfw.yaml"
[20:26:22] <jan_drewniak>	 ```
[20:27:39] <jan_drewniak>	 oh no, deployment error!
[20:28:22] <jan_drewniak>	 https://www.irccloud.com/pastebin/e6QkR3Gr/
[20:28:41] <dancy>	 https://spiderpig.wikimedia.org/jobs/1719 for those with access
[20:30:06] <jan_drewniak>	 seems like it rolled back fine, but no idea what that error was about.
[20:30:25] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service aqs1024-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:30:27] <wikibugs>	 (03PS2) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664)
[20:30:41] <dancy>	 swfrench-wmf: Are you around?
[20:30:55] <rzl>	 that "skipping missing values file" line is actually fine, your real error is from deeper in
[20:30:56] <rzl>	 Error: no cached repo found. (try 'helm repo update'): error loading /var/cache/helm/repository/wmf-stable-index.yaml: empty index.yaml file
[20:31:00] <rzl>	 and that *is* weird
[20:31:03] <dancy>	 Ah, rzl to the rescue!
[20:31:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite)
[20:31:48] <swfrench-wmf>	 o/
[20:31:52] <swfrench-wmf>	 puppet race?
[20:32:00] <dancy>	 I was wondering that.
[20:32:00] <rzl>	 that's what I was betting on, yeah
[20:32:00] <wikibugs>	 (03PS3) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664)
[20:32:08] <dancy>	 This is probably a "just retry" situation.
[20:32:15] <rzl>	 especially if the rollback worked, I bet a rollforward does too
[20:32:20] * swfrench-wmf nods
[20:32:26] <swfrench-wmf>	 I'll double-check the timing
[20:33:03] <rzl>	 jan_drewniak: try it again, and this time believe -- with your whole heart -- that your patch is worthy
[20:33:10] <wikibugs>	 (03PS4) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664)
[20:33:12] <dancy>	 haha
[20:33:23] <wikibugs>	 (03CR) 10Zabe: Use envoy for swift inside mediawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[20:33:25] <jan_drewniak>	 alright here goes! 
[20:34:00] <logmsgbot>	 !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]]
[20:34:03] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[20:34:05] <swfrench-wmf>	 puppet run from 20:19:14 to 20:24:50
[20:34:09] <swfrench-wmf>	 so yeah, that tracks
[20:34:15] <rzl>	 ah yep
[20:34:55] <rzl>	 the window for this race is a lot shorter than that 5½ minutes obviously, but we could still probably be cleverer about this if we needed to
[20:35:20] <swfrench-wmf>	 it does feel odd that this has now happened twice in the past month or so ... but I also don't want to draw connections between sparse data points
[20:35:49] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:35:49] <rzl>	 oh has it? hm, yeah
[20:36:24] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Continuing with sync
[20:36:34] <swfrench-wmf>	 I mean "twice" inclusive of this instance of it
[20:36:44] <rzl>	 nod
[20:37:07] <wikibugs>	 (03PS5) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664)
[20:38:06] <wikibugs>	 (03PS6) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664)
[20:38:35] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: disable unattended upgrades in base image [puppet] - 10https://gerrit.wikimedia.org/r/1269056 (https://phabricator.wikimedia.org/T422509)
[20:40:14] <logmsgbot>	 !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]] (duration: 06m 14s)
[20:40:17] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[20:40:20] <wikibugs>	 (03CR) 10Ladsgroup: Use envoy for swift inside mediawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[20:41:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:44:01] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] Remove unused JWT for bot password temporary config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01)
[20:44:07] <dancy>	  rzl/swfrench-wmf: thanks for the eyeballs
[20:45:15] * swfrench-wmf thumbs up
[20:51:49] <wikibugs>	 (03CR) 10Zabe: Use envoy for swift inside mediawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[20:54:43] <Amir1>	 jouncebot: nowandnext
[20:54:43] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2000)
[20:54:43] <jouncebot>	 In 0 hour(s) and 5 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2100)
[20:55:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[20:56:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[20:57:50] <wikibugs>	 (03Merged) 10jenkins-bot: Use envoy for swift inside mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[20:58:12] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (T328872)]]
[20:58:15] <stashbot>	 T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872
[20:58:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:58:56] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "https://phabricator.wikimedia.org/T422509#11801856" [puppet] - 10https://gerrit.wikimedia.org/r/1269056 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott)
[21:00:04] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (T328872)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2100)
[21:00:45] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[21:00:55] <Amir1>	 I'll be done really quickly
[21:04:39] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (T328872)]] (duration: 06m 27s)
[21:04:42] <stashbot>	 T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872
[21:09:59] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[21:12:13] <wikibugs>	 (03Merged) 10jenkins-bot: function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[21:17:25] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:19:05] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[21:20:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[21:20:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 499547320 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:21:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 616 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:24:36] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T422748 (10phaultfinder) 03NEW
[21:26:04] <wikibugs>	 (03PS1) 10Bernard Wang: Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063
[21:26:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (owner: 10Bernard Wang)
[21:27:42] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:36:23] <wikibugs>	 (03PS1) 10RLazarus: Revert "function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269064 (https://phabricator.wikimedia.org/T367880)
[21:39:40] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Revert "function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269064 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[21:41:56] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269064 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[21:45:28] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:45:40] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:46:00] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:47:23] <wikibugs>	 (03PS2) 10Cwhite: add beta-logs pki key [labs/private] - 10https://gerrit.wikimedia.org/r/1268683 (https://phabricator.wikimedia.org/T350516)
[21:47:43] <wikibugs>	 (03PS3) 10Cwhite: initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516)
[21:49:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[21:51:10] <wikibugs>	 (03PS4) 10Cwhite: initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516)
[21:55:48] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269067
[21:56:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269067 (owner: 10Ladsgroup)
[21:57:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269067 (owner: 10Ladsgroup)
[21:57:35] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269067|Revert "Use envoy for swift inside mediawiki"]]
[21:57:46] <wikibugs>	 (03CR) 10Aude: Enable reading list beta feature for pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (owner: 10Bernard Wang)
[21:59:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (owner: 10Bernard Wang)
[21:59:30] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1269067|Revert "Use envoy for swift inside mediawiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2200)
[22:00:35] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[22:04:29] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269067|Revert "Use envoy for swift inside mediawiki"]] (duration: 06m 54s)
[22:04:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T422748#11802135 (10phaultfinder)
[22:05:25] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1024-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:10:56] <wikibugs>	 (03PS1) 10RLazarus: function-{evaluator,orchestrator}: set AppArmor profile in container SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880)
[22:15:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11802153 (10VRiley-WMF) @Jclark-ctr was it able to finish the provisioning? I attempted to do this with ganeti1055, but it wasn't able to finish.   Oddly enough, the...
[22:16:42] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11802156 (10MMigurski-WMF) I have updated my email to a wikimedia.org address, and I requested access to the `wmf` group.  I believe that might be sufficient for my required access,...
[22:19:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11802189 (10Jclark-ctr) >>! In T418903#11802153, @VRiley-WMF wrote: > @Jclark-ctr was it able to finish the provisioning? I attempted to do this with ganeti1055, but...
[22:23:23] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 2.165e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[22:25:23] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 1 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[22:42:49] <wikibugs>	 (03PS3) 10Dzahn: zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938)
[22:45:25] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aqs1024-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:59:00] <wikibugs>	 (03PS4) 10Dzahn: zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938)
[23:00:44] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1269053" [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[23:01:09] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1269053/8400/" [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[23:17:41] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Stop testing the v1 orchestrator endpoint, we're dropping it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269072 (https://phabricator.wikimedia.org/T421768)
[23:22:45] <wikibugs>	 (03PS1) 10Dzahn: zuul::executor: add TLS full chain needed for zookeeper config [puppet] - 10https://gerrit.wikimedia.org/r/1269073 (https://phabricator.wikimedia.org/T421398)
[23:25:41] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1269073/8401/" [puppet] - 10https://gerrit.wikimedia.org/r/1269073 (https://phabricator.wikimedia.org/T421398) (owner: 10Dzahn)
[23:28:21] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] function-{evaluator,orchestrator}: set AppArmor profile in container SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[23:39:21] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:39:47] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1269077
[23:39:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1269077 (owner: 10TrainBranchBot)
[23:49:56] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1269077 (owner: 10TrainBranchBot)