[00:29:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:30:15] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:31:15] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:34:15] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:36:15] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:36:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:39:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:40:15] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:40:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:41:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:45:15] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:50:15] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:51:15] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:00:04] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268295 (owner: 10TrainBranchBot) [01:05:25] FIRING: [3x] ProbeDown: Service aqs1023-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:09:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268698 [01:09:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268698 (owner: 10TrainBranchBot) [01:15:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:57] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268698 (owner: 10TrainBranchBot) [02:00:55] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:07] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 12s) [02:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:35:15] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:41:15] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:41:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:44:15] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:44:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:45:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:48:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:54:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:56:15] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:59:15] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:59:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:00:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:02:15] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:35:25] FIRING: [3x] ProbeDown: Service aqs1023-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:37:08] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11797531 (10RobH) I'll put a more detailed timeline and update tomorrow but as it stands now: * unisys engineer showed up at 10am singapore time * swapped mainboard, damaged the CPU bracket and mainb... [04:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:10:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:08] (03PS1) 10Marostegui: db1152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268708 (https://phabricator.wikimedia.org/T418561) [05:13:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2142.codfw.wmnet,db1152.eqiad.wmnet with reason: Maintenance [05:13:43] (03CR) 10Marostegui: [C:03+2] db1152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268708 (https://phabricator.wikimedia.org/T418561) (owner: 10Marostegui) [05:15:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1152: Reimage [05:15:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:15:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:15:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1152: Reimage [05:15:59] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1152.eqiad.wmnet with OS trixie [05:20:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:33] (03PS1) 10Anzx: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) [05:23:23] (03PS2) 10Anzx: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) [05:24:10] (03CR) 10CI reject: [V:04-1] cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx) [05:29:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1152.eqiad.wmnet with reason: host reimage [05:33:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1152.eqiad.wmnet with reason: host reimage [05:36:08] (03PS3) 10Ayounsi: eqsin routed ganeti: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) [05:36:16] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [05:41:13] (03CR) 10Ayounsi: [C:03+2] eqsin routed ganeti: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [05:43:13] (03PS3) 10Anzx: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) [05:50:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1152.eqiad.wmnet with OS trixie [05:52:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:53:17] (03PS4) 10Anzx: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) [05:53:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11797619 (10ayounsi) [05:57:14] (03PS1) 10Marostegui: Revert "db1152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268832 [05:57:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:57:58] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1152: After reimage [05:57:58] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:58:00] (03CR) 10Marostegui: [C:03+2] Revert "db1152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268832 (owner: 10Marostegui) [05:58:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:58:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1152: After reimage [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T0600) [06:11:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:12:00] (03PS1) 10Muehlenhoff: Make ganeti5007 a routed Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1268834 (https://phabricator.wikimedia.org/T421863) [06:15:09] (03PS2) 10Anzx: wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268833 (https://phabricator.wikimedia.org/T421770) [06:15:15] PROBLEM - Ensure traffic_manager is running for instance backend on cp6016 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:16:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:16:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268833 (https://phabricator.wikimedia.org/T421770) (owner: 10Anzx) [06:16:15] RECOVERY - Ensure traffic_manager is running for instance backend on cp6016 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:16:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx) [06:40:40] (03PS1) 10Hashar: ci: enhance ci-build-images script [puppet] - 10https://gerrit.wikimedia.org/r/1268594 (https://phabricator.wikimedia.org/T422488) [06:59:25] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11797721 (10MoritzMuehlenhoff) @MMigurski-WMF The developer account needs to be linked to your @wikimedia.org email address. Please log into https://idm.wikimedia.org/ and then unde... [07:00:04] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T0700). [07:00:05] WMDE-Fisch and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] o/ [07:00:30] o/ [07:01:05] I could selve serve [07:01:40] Starting with my change now. [07:01:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [07:04:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268514 (https://phabricator.wikimedia.org/T420938) (owner: 10WMDE-Fisch) [07:05:49] (03Merged) 10jenkins-bot: Enable sub-references on Czech and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268514 (https://phabricator.wikimedia.org/T420938) (owner: 10WMDE-Fisch) [07:06:45] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1268514|Enable sub-references on Czech and Italian wiki (T420938)]] [07:06:49] T420938: Deploy Sub-referencing to itwiki and cswiki - https://phabricator.wikimedia.org/T420938 [07:08:42] !log wmde-fisch@deploy1003 wmde-fisch: Backport for [[gerrit:1268514|Enable sub-references on Czech and Italian wiki (T420938)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:11:14] !log wmde-fisch@deploy1003 wmde-fisch: Continuing with sync [07:13:16] 07sre-alert-triage, 06Quality-and-Test-Engineering-Team: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422581 (10LSobanski) 03NEW [07:13:21] 07sre-alert-triage, 06Quality-and-Test-Engineering-Team: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422582 (10LSobanski) 03NEW [07:14:12] 07sre-alert-triage, 06Quality-and-Test-Engineering-Team: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422581#11797758 (10LSobanski) Considering this is a critical alert that has been firing for a month, should it be downgraded or removed? [07:14:23] 07sre-alert-triage, 06Quality-and-Test-Engineering-Team: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422582#11797759 (10LSobanski) Considering this is a critical alert that has been firing for a month, should it be downgraded or removed? [07:15:29] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268514|Enable sub-references on Czech and Italian wiki (T420938)]] (duration: 08m 44s) [07:15:32] T420938: Deploy Sub-referencing to itwiki and cswiki - https://phabricator.wikimedia.org/T420938 [07:16:20] I'm done. [07:18:24] I could self-serve as well, but anzx was first. [07:18:45] need someone to deploy for me [07:18:57] anzx: I could do your patches as well, but I have no clue what they are doing exactly and there's no +1 from anybody else :think [07:19:02] anzx: I could do your patches as well, but I have no clue what they are doing exactly and there's no +1 from anybody else :thinking: [07:19:05] !log installing openssl security updates [07:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:06] anzx: I could do your patches as well, but I have no clue what they are doing exactly and there's no +1 from anybody else 🤔 [07:19:12] Ahh [07:21:15] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11797798 (10matej_suchanek) >>! In T421642#11785461, @Xqt wrote: > The problems began on March 25th: > {F74901675} Please (re)attach the file, so that it's visible if i... [07:21:22] (03CR) 10Ayounsi: [C:03+1] Make ganeti5007 a routed Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1268834 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:23:44] WMDE-Fisch , i can schedule it for later if you couldn't deploy [07:24:45] I'm more confident with the Wikimania wiki part. I'll do that then you already have half of the job done ;-) [07:24:51] (03CR) 10Ayounsi: "Using RIPE Atlas:" [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [07:25:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268833 (https://phabricator.wikimedia.org/T421770) (owner: 10Anzx) [07:25:32] WMDE-Fisch: ok [07:26:13] (03Merged) 10jenkins-bot: wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268833 (https://phabricator.wikimedia.org/T421770) (owner: 10Anzx) [07:26:37] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1268833|wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup (T421770)]] [07:26:41] T421770: wikimaniawiki: add editsemiprotected to extendedconfirmed group - https://phabricator.wikimedia.org/T421770 [07:28:26] !log wmde-fisch@deploy1003 wmde-fisch, anzx: Backport for [[gerrit:1268833|wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup (T421770)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:28:30] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/aux-codfw: maintenance [07:28:31] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/aux-codfw: maintenance [07:28:32] anzx: Want to test anything with that patch? [07:29:09] (03CR) 10Arnaudb: [C:03+2] gerrit: disable connection re-use [puppet] - 10https://gerrit.wikimedia.org/r/1268557 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [07:29:15] WMDE-Fisch: looks good to sync [07:29:19] !log wmde-fisch@deploy1003 wmde-fisch, anzx: Continuing with sync [07:33:31] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268833|wikimaniawiki: add editsemiprotected userright to extendedconfirmed usergroup (T421770)]] (duration: 06m 54s) [07:33:35] T421770: wikimaniawiki: add editsemiprotected to extendedconfirmed group - https://phabricator.wikimedia.org/T421770 [07:33:50] anzx: Done. :-) [07:33:56] thank you [07:33:59] (03PS3) 10Majavah: hieradata: service: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040) [07:33:59] (03PS3) 10Majavah: O:dumps::distribution::server: Configure as LVS realserver [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) [07:33:59] (03PS3) 10Majavah: hieradata: Move dumps to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1268506 (https://phabricator.wikimedia.org/T422040) [07:34:00] (03PS4) 10Majavah: hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) [07:34:03] Krinkle: You can go on then. [07:34:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [07:35:22] (03CR) 10Majavah: [C:03+2] hieradata: service: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [07:35:35] (03CR) 10Majavah: [C:03+2] O:dumps::distribution::server: Configure as LVS realserver [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [07:35:57] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [07:36:03] (03Merged) 10jenkins-bot: Enable wgTrackMediaRequestProvenance on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267437 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [07:36:29] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] [07:36:32] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [07:38:18] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:40:39] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin routed ganeti IPs - ayounsi@cumin1003" [07:40:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin routed ganeti IPs - ayounsi@cumin1003" [07:40:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:40:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:41:54] !log krinkle@deploy1003 krinkle: Continuing with sync [07:46:04] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] (duration: 09m 34s) [07:46:07] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [07:48:14] (03CR) 10Majavah: [C:03+2] "looks good: https://phabricator.wikimedia.org/P90323" [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [07:48:26] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/aux-codfw: maintenance [07:48:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/aux-codfw: maintenance [07:53:00] (03PS1) 10Slyngshede: CSS: Improve footer on mobile [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1268895 (https://phabricator.wikimedia.org/T422203) [07:54:26] (03PS1) 10Elukey: service: allow k8s-ingress-aux to be depooled [puppet] - 10https://gerrit.wikimedia.org/r/1268896 (https://phabricator.wikimedia.org/T414486) [07:55:04] (03CR) 10JMeybohm: [C:03+1] service: allow k8s-ingress-aux to be depooled [puppet] - 10https://gerrit.wikimedia.org/r/1268896 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [07:55:49] (03CR) 10Elukey: [C:03+2] service: allow k8s-ingress-aux to be depooled [puppet] - 10https://gerrit.wikimedia.org/r/1268896 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [07:55:55] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti5007 a routed Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/1268834 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:56:29] elukey: okay to merge your ingress change along now? [07:56:38] +1 [07:57:42] and merged [08:00:05] dancy and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T0800). [08:01:22] (03CR) 10ArielGlenn: "one question, couple typos noted" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (owner: 10Daniel Kinzler) [08:02:37] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/aux-codfw: maintenance [08:03:12] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/aux-codfw: maintenance [08:04:22] !log elukey@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster aux-codfw: Kubernetes upgrade [08:04:31] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11797895 (10jcrespo) >>! In T419970#11795620, @Jhancock.wm wrote: > @jcrespo would loading the disks from a foreign config be acceptable for... [08:06:25] (03CR) 10Elukey: [C:03+2] Add istio 1.24 config for k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267568 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [08:06:39] (03CR) 10Elukey: [C:03+2] admin_ng: upgrade aux-k8s-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265427 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [08:06:55] (03CR) 10Elukey: [C:03+2] Upgrade aux-k8s-codfw to k8s 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1265426 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [08:08:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl2003.codfw.wmnet are marked down but pooled: k8s-ingress-aux_30443: Servers aux-k8s-worker2003.codfw.wmnet, aux-k8s-worker2005.codfw.wmnet, aux-k8s-worker2002.codfw.wmnet, aux-k8s-worker2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:08:17] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl2003.codfw.wmnet are marked down but pooled: k8s-ingress-aux_30443: Servers aux-k8s-worker2003.codfw.wmnet, aux-k8s-worker2002.codfw.wmnet, aux-k8s-worker2004.codfw.wmnet, aux-k8s-worker2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:08:47] this is me --^ [08:10:25] FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti5007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [08:15:06] (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1262062 (https://phabricator.wikimedia.org/T421402) [08:15:12] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262062 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [08:16:27] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 03 Jun 2026 06:56:12 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:18:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1268895 (https://phabricator.wikimedia.org/T422203) (owner: 10Slyngshede) [08:19:51] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [08:20:18] elukey@cumin1003 wipe-cluster (PID 3700395) is awaiting input [08:20:46] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [08:22:50] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [08:23:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [08:24:43] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268897 (https://phabricator.wikimedia.org/T421972) [08:24:54] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [08:25:56] (03PS2) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268897 (https://phabricator.wikimedia.org/T421972) [08:26:26] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.8 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268898 (https://phabricator.wikimedia.org/T421972) [08:28:20] (03CR) 10Slyngshede: [V:03+2 C:03+2] CSS: Improve footer on mobile [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1268895 (https://phabricator.wikimedia.org/T422203) (owner: 10Slyngshede) [08:30:55] FIRING: [3x] SystemdUnitFailed: kube-scheduler.service on aux-k8s-ctrl2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:03] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: sync [08:31:17] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: sync [08:31:28] (03PS1) 10Ayounsi: Add PTR includes for eqsin routed ganeti ranges [dns] - 10https://gerrit.wikimedia.org/r/1268899 (https://phabricator.wikimedia.org/T421863) [08:31:47] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/kafka-mirrormaker: sync [08:32:09] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: sync [08:32:34] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: sync [08:32:42] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: sync [08:32:51] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: sync [08:33:03] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: sync [08:33:28] elukey@cumin1003 wipe-cluster (PID 3700395) is awaiting input [08:34:03] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:34:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:35:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:35:40] FIRING: [2x] ProbeDown: Service aqs1023-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:36:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [08:37:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [08:39:34] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: sync [08:40:09] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: sync [08:40:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:41:36] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster aux-codfw: Kubernetes upgrade [08:42:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [08:44:25] (03CR) 10Muehlenhoff: [C:03+1] "Matches what is in Netbox, looks good" [dns] - 10https://gerrit.wikimedia.org/r/1268899 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:44:34] ayounsi@cumin1003 reimage (PID 3704057) is awaiting input [08:45:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [08:47:33] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/aux-codfw: maintenance [08:47:57] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/aux-codfw: maintenance [08:51:00] (03CR) 10Ayounsi: [C:03+2] Add PTR includes for eqsin routed ganeti ranges [dns] - 10https://gerrit.wikimedia.org/r/1268899 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:51:01] (03CR) 10Tiziano Fogli: [C:03+2] thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [08:51:25] !log ayounsi@dns1004 START - running authdns-update [08:51:49] (03PS1) 10Elukey: service: allow k8s-ingress-aux-rw to be active/active [puppet] - 10https://gerrit.wikimedia.org/r/1268902 (https://phabricator.wikimedia.org/T414486) [08:52:46] !log ayounsi@dns1004 END - running authdns-update [08:53:03] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS trixie [08:53:09] !log taavi@cumin1003 conftool action : set/pooled=yes; selector: name=clouddumps1001.wikimedia.org [08:54:14] (03CR) 10Majavah: [C:03+2] cr-cloud-vrf: Remove clouddumps NAT exemption rule [homer/public] - 10https://gerrit.wikimedia.org/r/1268516 (owner: 10Majavah) [08:56:40] (03CR) 10Elukey: tox: rework venvs to speed up local and CI timings (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:56:51] (03PS1) 10Gkyziridis: ml-services: Configure autoscaling for rr-multilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268903 (https://phabricator.wikimedia.org/T415892) [08:59:48] (03CR) 10Gkyziridis: [C:03+2] ml-services: Configure autoscaling for rr-multilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268903 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [09:00:01] !log remove unused cloud-vrf clouddumps cr firewall rule https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1268516 [09:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:01] (03Merged) 10jenkins-bot: ml-services: Configure autoscaling for rr-multilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268903 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [09:12:24] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11798075 (10Xqt) >>! In T421642#11797798, @matej_suchanek wrote: > > Please (re)attach the file, so that it's visible if it's important ([[ https://www.mediawiki.org/wi... [09:19:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [09:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:49] (03CR) 10JMeybohm: "LGTM but could you please add a note about this to https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_control-planes ?" [dns] - 10https://gerrit.wikimedia.org/r/1265480 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [09:24:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [09:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 17.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:27:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [09:41:35] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2006.codfw.wmnet with OS trixie [09:45:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:47:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [09:50:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [09:54:21] !log upgrading haproxy to version 3.2.15 on magru,drmrs,ulsfo (T421402) [09:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:24] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [09:55:15] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-magru - 3.2.15 upgrade (T421402) [09:55:19] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-drmrs - 3.2.15 upgrade (T421402) [09:55:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [09:55:23] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-ulsfo - 3.2.15 upgrade (T421402) [09:56:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [09:58:04] (03CR) 10Giuseppe Lavagetto: [C:04-1] "This contradicts both what we do at the edge and our policies - if someone doesn't change the default UA of program they're using, it make" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268520 (https://phabricator.wikimedia.org/T422471) (owner: 10Daniel Kinzler) [09:58:55] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS trixie [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1000) [10:00:05] dues: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:02:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1011.eqiad.wmnet with OS bullseye [10:02:38] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798210 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1011.eq... [10:02:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-fe1011 [10:03:06] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [10:08:46] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1011 - mvernon@cumin2002" [10:08:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1011 - mvernon@cumin2002" [10:08:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:08:52] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-fe1011.eqiad.wmnet 182.32.64.10.in-addr.arpa 2.8.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:08:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-fe1011.eqiad.wmnet 182.32.64.10.in-addr.arpa 2.8.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:08:56] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1011 [10:11:03] There's a pending restbase deploy that I will steal this window for if it's not being used [10:11:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1011 [10:11:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1011 [10:11:19] it'll only affect mathoid so it's very low risk if other things move in parallel [10:12:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:12:37] !log hnowlan@deploy1003 Started deploy [restbase/deploy@dcc15be]: Add urwikisource T415975 [10:12:40] T415975: Add urwikisource to RESTBase - https://phabricator.wikimedia.org/T415975 [10:14:08] !log hnowlan@deploy1003 Finished deploy [restbase/deploy@dcc15be]: Add urwikisource T415975 (duration: 01m 31s) [10:15:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:18:30] (all done) [10:25:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage [10:29:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage [10:30:03] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596 (10MoritzMuehlenhoff) 03NEW [10:30:09] (03PS1) 10Effie Mouzeli: mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336) [10:30:09] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11798260 (10MoritzMuehlenhoff) p:05Triage→03High [10:30:20] (03CR) 10Slyngshede: [C:03+1] Grant sudo privileges for the analytics-fr-tech-users group [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [10:35:52] (03CR) 10Clément Goubert: [C:03+1] mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [10:36:54] ty hnowlan :) [10:37:55] (03CR) 10Blake: [C:03+2] mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [10:40:02] (03Merged) 10jenkins-bot: mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [10:41:55] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:42:01] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268058 (owner: 10PipelineBot) [10:42:20] !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:42:21] !log blake@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:42:41] !log blake@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:44:00] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268058 (owner: 10PipelineBot) [10:48:46] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2006.codfw.wmnet with OS trixie [10:52:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1011.eqiad.wmnet with OS bullseye [10:52:18] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1011.eqiad.... [10:52:38] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe[1009-1010,1012-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [10:55:55] RESOLVED: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti5007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:04] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1100). [11:01:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe[1009-1010,1012-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [11:10:48] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:11:03] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798427 (10MatthewVernon) [11:11:05] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:11:14] !log installing Tomcat security updates [11:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:49] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798430 (10MatthewVernon) A wrinkle here is that ferm doesn't get reloaded on the other swift nodes (presumably because th... [11:13:04] (03PS1) 10Mvolz: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268921 [11:13:16] (03CR) 10Mvolz: [C:03+2] Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268921 (owner: 10Mvolz) [11:14:30] 06SRE, 10Wikimedia-Mailing-lists: Close mailing list editing-team@lists.wikimedia.org - https://phabricator.wikimedia.org/T422562#11798438 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}} [11:15:39] (03Merged) 10jenkins-bot: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268921 (owner: 10Mvolz) [11:15:42] !log installing dpkg security updates [11:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:01] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11798465 (10taavi) >>! In T421719#11798427, @MatthewVernon wrote: > A wrinkle here is that ferm doesn't get reloaded on the... [11:23:32] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-ulsfo - 3.2.15 upgrade (T421402) [11:23:38] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [11:30:24] (03PS1) 10Gkyziridis: ml-services: Deploy autoscaling for rr-mulgilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268923 (https://phabricator.wikimedia.org/T415892) [11:33:03] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy autoscaling for rr-mulgilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268923 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:35:00] (03Merged) 10jenkins-bot: ml-services: Deploy autoscaling for rr-mulgilingual on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268923 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:35:38] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [11:35:49] (03CR) 10KartikMistry: [C:03+2] machinetranslation: Remove networkpolicies for people* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264671 (https://phabricator.wikimedia.org/T335491) (owner: 10JMeybohm) [11:37:49] (03Merged) 10jenkins-bot: machinetranslation: Remove networkpolicies for people* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264671 (https://phabricator.wikimedia.org/T335491) (owner: 10JMeybohm) [11:38:57] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-magru - 3.2.15 upgrade (T421402) [11:39:00] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [11:39:26] Deploying MinT; Minor changes. [11:41:23] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [11:41:30] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [11:42:14] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-drmrs - 3.2.15 upgrade (T421402) [11:42:44] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [11:42:52] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [11:43:33] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [11:43:38] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [11:44:28] !log machinetranslation: Remove networkpolicies for people* (T335491) [11:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:31] T335491: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 [11:45:25] FIRING: [24x] ProbeDown: Service aqs1023-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:46:48] (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual model on experimental with autoscaling enabled. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268927 (https://phabricator.wikimedia.org/T415892) [11:48:51] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual model on experimental with autoscaling enabled. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268927 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:50:25] FIRING: [24x] ProbeDown: Service aqs1023-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:15] (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual model on experimental with autoscaling enabled. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268927 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:53:06] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:07:28] !log jiji@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM wikikube-worker-exp1001.eqiad.wmnet [12:09:23] (03PS1) 10Gkyziridis: fix empty line [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268933 [12:10:50] (03PS1) 10Jelto: gitlab: do not send gitlab logs to journal/syslog [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589) [12:13:07] (03PS2) 10Arnaudb: gerrit: shorten Envoy upstream idle timeout to 100s [puppet] - 10https://gerrit.wikimedia.org/r/1268932 (https://phabricator.wikimedia.org/T421827) [12:13:07] (03CR) 10Arnaudb: "disabling connection reuse at CDN level did not fix `GnuTLS recv error (-54)` happening in CI." [puppet] - 10https://gerrit.wikimedia.org/r/1268932 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [12:13:16] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8390/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto) [12:13:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM wikikube-worker-exp1001.eqiad.wmnet [12:14:06] (03PS3) 10Arnaudb: gerrit: shorten Envoy upstream idle timeout to 100s [puppet] - 10https://gerrit.wikimedia.org/r/1268932 (https://phabricator.wikimedia.org/T421827) [12:14:14] (03PS2) 10Effie Mouzeli: mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336) [12:15:16] !log mszwarc@deploy1003 mwscript-k8s job started: foreachwikiindblist all backfillInterwikiRightsLog.php --remote-wiki metawiki 20260311190000 # T6055 [12:15:19] T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055 [12:15:27] (03PS2) 10Jelto: gitlab: do not send gitlab logs to journal/syslog [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589) [12:16:48] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8391/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto) [12:18:35] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8392/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268934 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto) [12:19:12] (03CR) 10Gkyziridis: [C:03+2] fix empty line [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268933 (owner: 10Gkyziridis) [12:19:49] (03CR) 10Clément Goubert: [C:03+1] mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli) [12:21:23] (03Merged) 10jenkins-bot: fix empty line [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268933 (owner: 10Gkyziridis) [12:23:49] (03CR) 10Arnaudb: [C:03+2] gerrit: shorten Envoy upstream idle timeout to 100s [puppet] - 10https://gerrit.wikimedia.org/r/1268932 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [12:24:14] (03CR) 10Effie Mouzeli: [C:03+2] mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli) [12:26:15] (03Merged) 10jenkins-bot: mw-paroid: bump resources and workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268914 (https://phabricator.wikimedia.org/T420336) (owner: 10Effie Mouzeli) [12:26:43] (03PS1) 10Gkyziridis: ml-services: Deploy both revertrisk models on experimental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268936 [12:27:00] !log jiji@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM wikikube-worker-exp2001.codfw.wmnet [12:27:32] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:27:52] !log jiji@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM wikikube-worker-exp2001.codfw.wmnet [12:28:04] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:28:37] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:29:17] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:29:35] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy both revertrisk models on experimental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268936 (owner: 10Gkyziridis) [12:31:28] (03Merged) 10jenkins-bot: ml-services: Deploy both revertrisk models on experimental. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268936 (owner: 10Gkyziridis) [12:31:59] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:32:06] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:32:12] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:32:16] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:32:39] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:34:07] 06SRE, 06Infrastructure-Foundations: Update debdeploy to use checkrestart instead of lsof to detect library restarts - https://phabricator.wikimedia.org/T422614 (10MoritzMuehlenhoff) 03NEW [12:34:10] (03PS1) 10Volans: debdeploy: use cumin v6.0.0 new APIs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268937 [12:38:39] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1268506 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [12:40:11] (03CR) 10Majavah: [C:03+2] hieradata: Move dumps to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1268506 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [12:40:17] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS trixie [12:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:43:12] !log restarting pybal on lvs1020 [12:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:25] FIRING: [2x] ProbeDown: Service aqs1023-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:20] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:50:35] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 18 connections established with conf1007.eqiad.wmnet:4001 (min=22) https://wikitech.wikimedia.org/wiki/PyBal [12:52:06] (03CR) 10Elukey: [C:03+1] debdeploy: use cumin v6.0.0 new APIs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268937 (owner: 10Volans) [12:53:37] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:53:58] (03PS1) 10Majavah: P:dumps: rsync: Do not use LOAD_BALANCER_HEALTH_CHECKS [puppet] - 10https://gerrit.wikimedia.org/r/1268942 [12:54:52] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1018 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Majavah adding dumps-lb - The acknowledgement expires at: 2026-04-09 14:54:38. https://wikitech.wikimedia.org/wiki/PyBal [12:54:52] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 18 connections established with conf1007.eqiad.wmnet:4001 (min=22) Majavah adding dumps-lb - The acknowledgement expires at: 2026-04-09 14:54:38. https://wikitech.wikimedia.org/wiki/PyBal [12:54:55] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx) [12:55:24] cscott: question about the change you scheduled for deployment, just out of interest – is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1268679 not blocked on the same issue as https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1268680 ? [12:56:05] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Majavah dumps-lb - The acknowledgement expires at: 2026-04-09 13:55:54. https://wikitech.wikimedia.org/wiki/PyBal [12:56:05] ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - dumps-lb_873: Servers clouddumps1001.wikimedia.org are marked down but pooled: dumps-lb6_873: Servers clouddumps1001.wikimedia.org are marked down but pooled Majavah dumps-lb - The acknowledgement expires at: 2026-04-09 13:55:54. https://wikitech.wikimedia.org/wiki/PyBal [12:56:19] (03CR) 10Majavah: [C:03+2] P:dumps: rsync: Do not use LOAD_BALANCER_HEALTH_CHECKS [puppet] - 10https://gerrit.wikimedia.org/r/1268942 (owner: 10Majavah) [12:56:51] Lucas_WMDE: eswiki doesn't use flagged revs to the same degree as dewiki, as far as I know [12:57:10] ah, I missed the FlaggedRevs connection [12:57:11] thanks! [12:58:07] Also my fault for not clarifying that there were two parts to that bug: flagged revs and an issue with oldids and the revision cache, and I backported the revision cache fix to WMF.22 yesterday [12:59:18] The flagged revs fix is in wmf.23 but it was a little too complicated for a backport, and for cache reasons we'd actually like to do the deploy in 2 steps anyway [13:00:01] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:01:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268937 (owner: 10Volans) [13:01:59] (03CR) 10Volans: [V:03+2 C:03+2] debdeploy: use cumin v6.0.0 new APIs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268937 (owner: 10Volans) [13:02:54] jouncebot: now [13:02:54] For the next 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1300) [13:03:18] Here [13:03:22] I can spiderpig [13:03:27] o/ [13:04:24] anzx: do you want to go first? [13:04:51] !log restarting pybal on lvs1018 [13:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:19] cscott: need someone to deploy mine, please go ahead [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:35] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 22 connections established with conf1007.eqiad.wmnet:4001 (min=22) https://wikitech.wikimedia.org/wiki/PyBal [13:06:15] Ok [13:07:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268679 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian) [13:07:43] o/ [13:08:00] sorry, got so distracted that I missed the start of the actual window [13:08:06] I can deploy for anzx once cscott is done :) [13:08:07] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268679 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian) [13:08:36] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1268679|Turn on Parsoid Read Views for eswiki (T422524)]] [13:08:37] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:08:39] T422524: Parsoid Read Views to deploy ~2026-04-07 - https://phabricator.wikimedia.org/T422524 [13:09:28] (03PS1) 10Jelto: gitlab: add feature flag for rsyslog input and disable in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) [13:09:35] o/ I have a patch that needs backporting and deploying [13:09:41] I'll get the backports ready [13:09:57] (03PS1) 10Phuedx: PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) [13:10:19] (03PS1) 10Phuedx: PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268948 (https://phabricator.wikimedia.org/T422112) [13:10:34] !log cscott@deploy1003 cscott: Backport for [[gerrit:1268679|Turn on Parsoid Read Views for eswiki (T422524)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:10:41] (03PS5) 10Majavah: hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) [13:11:47] !log cscott@deploy1003 cscott: Continuing with sync [13:12:04] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8393/console" [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto) [13:14:23] (03CR) 10Fabfur: [C:03+1] hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:14:44] (03CR) 10Majavah: [C:03+2] hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:15:42] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268679|Turn on Parsoid Read Views for eswiki (T422524)]] (duration: 07m 06s) [13:15:46] T422524: Parsoid Read Views to deploy ~2026-04-07 - https://phabricator.wikimedia.org/T422524 [13:15:59] ok, all done. [13:16:05] thanks! I’ll continue [13:16:08] over to you, Lucas_WMDE and anzx [13:16:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx) [13:17:58] hmph, zuul/CI is quite busy [13:18:05] hasn’t even started the gate-and-submit jobs yet [13:18:57] (03CR) 10CI reject: [V:04-1] PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:19:24] someone seems to just have pushed a patch with a rather long dependency chain [13:20:48] spurious failures aren't helping, since that's causing jenkins to invalidate all its work on the gate-and-submit pipeline and start over [13:20:51] (03CR) 10Phuedx: "Recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:22:02] (03Merged) 10jenkins-bot: cswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268820 (https://phabricator.wikimedia.org/T422520) (owner: 10Anzx) [13:22:24] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1268820|cswiki: lift IP cap for workshop (T422520)]] [13:22:27] T422520: Lift IP cap on 2026-04-13 for Students Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T422520 [13:24:11] (03CR) 10Arnaudb: [C:03+1] "small nit inline, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto) [13:24:17] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1268820|cswiki: lift IP cap for workshop (T422520)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:24:33] anzx: anything you want to test on mwdebug? [13:24:46] nothing to test [13:25:02] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync [13:25:04] sounds good [13:25:35] phuedx: are you adding your backports to the deployment calendar btw? [13:25:41] Will do [13:25:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268948 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:26:01] thanks :) [13:26:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:26:17] do you want to deploy them yourself (once the current scap is done) or shall I? [13:26:27] !log upgrade debdeploy-server on cumin2002 to 0.0.99.14-1+deb12u1+exp1 (temporary build with Cumin 6 compat before we have Cumin 6 universally) [13:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:46] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268820|cswiki: lift IP cap for workshop (T422520)]] (duration: 06m 22s) [13:28:50] T422520: Lift IP cap on 2026-04-13 for Students Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T422520 [13:28:59] phuedx: over to you! [13:29:11] (03PS1) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) [13:30:07] (03PS1) 10Majavah: dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) [13:30:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268948 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:30:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:30:41] or would you like me to deploy? [13:30:42] ah ok [13:31:47] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2006.codfw.wmnet with OS trixie [13:32:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8395/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:32:21] Thanks Lucas_WMDE [13:32:51] (03Merged) 10jenkins-bot: PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268948 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:32:53] (03Merged) 10jenkins-bot: PHP SDK: Measure known experiments correctly [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268947 (https://phabricator.wikimedia.org/T422112) (owner: 10Phuedx) [13:33:22] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1268948|PHP SDK: Measure known experiments correctly (T422112)]], [[gerrit:1268947|PHP SDK: Measure known experiments correctly (T422112)]] [13:33:25] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [13:33:43] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal-scholarly,name=eqiad [13:34:17] (03PS1) 10Majavah: wikimedia.org: Send dumps-rsync to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268954 (https://phabricator.wikimedia.org/T422040) [13:34:19] (03PS1) 10Majavah: wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040) [13:35:16] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1268948|PHP SDK: Measure known experiments correctly (T422112)]], [[gerrit:1268947|PHP SDK: Measure known experiments correctly (T422112)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:26] (03CR) 10Atsuko: [C:04-1] "this is a preliminary diff, no review needed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [13:36:37] Looking now [13:37:14] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11799464 (10MoritzMuehlenhoff) [13:37:30] I did a quick browse of a couple of wikis and saw no errors/warnings in the logs. Continuing [13:37:36] !log phuedx@deploy1003 phuedx: Continuing with sync [13:41:21] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268948|PHP SDK: Measure known experiments correctly (T422112)]], [[gerrit:1268947|PHP SDK: Measure known experiments correctly (T422112)]] (duration: 07m 58s) [13:41:24] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [13:42:03] (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual and langugage-agnostic in experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268963 (https://phabricator.wikimedia.org/T415892) [13:42:11] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual and langugage-agnostic in experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268963 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [13:42:51] I think that’s it! [13:42:56] !log UTC afternoon backport+config window done [13:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:02] * phuedx watches the logs [13:43:21] Looking good at the moment [13:44:16] (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual and langugage-agnostic in experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268963 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [13:45:39] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:46:22] (03CR) 10Tiziano Fogli: [C:03+1] "Gotcha — we're talking about 11 metrics (6 envoy_cluster_update + 5 envoy_dns) across ~1800 scraping-job targets, resulting in roughly 20k" [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm) [13:51:09] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: add feature flag for rsyslog input and disable in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto) [13:51:26] (03PS1) 10Volans: debdeploy: fix typo in printed message [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268964 [13:52:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268964 (owner: 10Volans) [13:52:31] (03CR) 10Volans: [V:03+2 C:03+2] debdeploy: fix typo in printed message [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1268964 (owner: 10Volans) [13:53:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11799697 (10DPogorzelski-WMF) a:05Jclark-ctr→03klausman [13:55:41] (03PS1) 10Sergio Gimeno: EventStreamConfig: remove unused contextual attributes causing problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) [13:59:38] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-04-01-092119 to 2026-04-06-224243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268966 (https://phabricator.wikimedia.org/T421815) [13:59:40] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-31-162258 to 2026-04-07-234729 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268967 (https://phabricator.wikimedia.org/T407903) [14:00:43] (03CR) 10Phuedx: [C:03+1] EventStreamConfig: remove unused contextual attributes causing problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno) [14:02:08] (03PS1) 10Brouberol: deployment_server: remove un-used opensearch-test-codfw kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1268968 [14:02:50] (03PS1) 10Clément Goubert: data.yaml: cgoubert: Replace non-FIDO key with backup [puppet] - 10https://gerrit.wikimedia.org/r/1268970 [14:03:17] (03PS1) 10Atsuko: atsuko: backup Yubikey and krb [puppet] - 10https://gerrit.wikimedia.org/r/1268972 [14:03:19] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:04:10] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:05:33] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2026-04-01-092119 to 2026-04-06-224243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268966 (https://phabricator.wikimedia.org/T421815) (owner: 10Jforrester) [14:07:05] (03CR) 10Bking: [C:03+1] deployment_server: remove un-used opensearch-test-codfw kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1268968 (owner: 10Brouberol) [14:07:24] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:07:36] (03CR) 10Brouberol: [C:03+1] "Yubikey pubkey validated out of band, and +1 on the kerberos addition" [puppet] - 10https://gerrit.wikimedia.org/r/1268972 (owner: 10Atsuko) [14:07:38] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:07:47] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-04-01-092119 to 2026-04-06-224243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268966 (https://phabricator.wikimedia.org/T421815) (owner: 10Jforrester) [14:08:30] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:09:05] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:10:17] (03CR) 10Atsuko: [C:03+2] atsuko: backup Yubikey and krb [puppet] - 10https://gerrit.wikimedia.org/r/1268972 (owner: 10Atsuko) [14:10:19] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:10:39] (03CR) 10Atsuko: [C:03+2] "merging" [puppet] - 10https://gerrit.wikimedia.org/r/1268972 (owner: 10Atsuko) [14:11:04] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:11:14] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:11:54] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:13:35] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-31-162258 to 2026-04-07-234729 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268967 (https://phabricator.wikimedia.org/T407903) (owner: 10Jforrester) [14:15:55] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-31-162258 to 2026-04-07-234729 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268967 (https://phabricator.wikimedia.org/T407903) (owner: 10Jforrester) [14:16:32] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:17:13] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:17:58] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:18:34] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:18:45] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:19:16] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:19:51] (03PS2) 10Majavah: dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) [14:19:51] (03PS1) 10Majavah: hieradata: Fix dumps http probe [puppet] - 10https://gerrit.wikimedia.org/r/1268978 [14:19:51] (03PS1) 10Majavah: hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979 [14:20:17] (03PS2) 10Majavah: hieradata: Fix dumps http probe [puppet] - 10https://gerrit.wikimedia.org/r/1268978 (https://phabricator.wikimedia.org/T422040) [14:20:19] (03PS3) 10Majavah: dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) [14:20:19] (03PS2) 10Majavah: hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979 [14:22:01] (03CR) 10Brouberol: [C:03+2] deployment_server: remove un-used opensearch-test-codfw kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1268968 (owner: 10Brouberol) [14:24:06] (03PS1) 10Majavah: Revert "P:toolforge::prometheus: Disable istio-gateway scrape for now" [puppet] - 10https://gerrit.wikimedia.org/r/1268981 (https://phabricator.wikimedia.org/T421386) [14:25:25] RESOLVED: ProbeDown: Service aqs1023-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1023-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1430) [14:31:24] (03CR) 10Filippo Giunchedi: [C:03+1] wikimedia.org: Send dumps-rsync to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268954 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [14:31:34] (03CR) 10Filippo Giunchedi: [C:03+1] wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [14:31:47] (03CR) 10Filippo Giunchedi: [C:03+1] dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [14:31:54] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1262062 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [14:32:09] !log upgrading eqsin to haproxy 3.2 (T421402) [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:12] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [14:33:49] (03CR) 10Scott French: [C:03+2] wikikube: Temporarily double coredns replicas (12) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268573 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French) [14:34:45] (03CR) 10Filippo Giunchedi: [C:03+1] hieradata: Fix dumps http probe [puppet] - 10https://gerrit.wikimedia.org/r/1268978 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [14:35:48] (03CR) 10Majavah: [C:03+2] hieradata: Fix dumps http probe [puppet] - 10https://gerrit.wikimedia.org/r/1268978 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [14:35:59] (03CR) 10Majavah: [C:03+2] dumps: web: Add header for host that served the request [puppet] - 10https://gerrit.wikimedia.org/r/1268952 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [14:36:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [14:37:10] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin - 3.2 upgrade (T421402) [14:37:13] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [14:37:25] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin - 3.2 upgrade (T421402) [14:38:31] (03PS1) 10Majavah: dumps: web: Remove plaintext HTTP server [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672) [14:39:35] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8396/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672) (owner: 10Majavah) [14:39:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org [14:40:25] (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1262063 (https://phabricator.wikimedia.org/T421402) [14:40:30] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262063 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [14:40:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:40:55] (03CR) 10Majavah: [C:03+2] wikimedia.org: Send dumps-rsync to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268954 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [14:41:08] !log taavi@dns1004 START - running authdns-update [14:41:31] (03Merged) 10jenkins-bot: wikikube: Temporarily double coredns replicas (12) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268573 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French) [14:41:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1268970 (owner: 10Clément Goubert) [14:42:27] !log taavi@dns1004 END - running authdns-update [14:42:29] (03CR) 10Clément Goubert: [C:03+2] data.yaml: cgoubert: Replace non-FIDO key with backup [puppet] - 10https://gerrit.wikimedia.org/r/1268970 (owner: 10Clément Goubert) [14:46:20] (03CR) 10Elukey: "Tested in https://phabricator.wikimedia.org/T420993#11799738" [puppet] - 10https://gerrit.wikimedia.org/r/1265382 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:47:59] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:48:56] !log serve dumps rsync traffic via new LVS service T422040 [14:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:58] T422040: Migrate clouddumps https/rsync interfaces behind LVS - https://phabricator.wikimedia.org/T422040 [14:49:43] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:53:30] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:56:16] (03PS1) 10Elukey: ipmi: allow to run commands as another user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1268997 (https://phabricator.wikimedia.org/T418929) [14:57:25] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:58:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:58:43] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:00:03] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1268997 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [15:00:47] !log derick@deploy1003 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=zhwiki --logwiki=metawiki 'Mr Kazi Tuhin' KaziHasanTuhin # T422677 [15:00:50] T422677: Unblock stuck global rename of KaziHasanTuhin - https://phabricator.wikimedia.org/T422677 [15:05:22] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1081.eqiad.wmnet with OS bullseye [15:05:53] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1081 [15:06:11] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:07:57] (03CR) 10Krinkle: [C:03+1] Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [15:10:07] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1081 - bking@cumin2002" [15:10:12] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1081 - bking@cumin2002" [15:10:12] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:10:13] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1081.eqiad.wmnet 166.32.64.10.in-addr.arpa 6.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:10:16] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1081.eqiad.wmnet 166.32.64.10.in-addr.arpa 6.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:10:18] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1081 [15:11:29] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1081 [15:11:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1081 [15:13:03] (03PS1) 10Hnowlan: admin: add backup yubikey for hnowlan [puppet] - 10https://gerrit.wikimedia.org/r/1268999 [15:14:23] (03PS2) 10Hnowlan: admin: add backup yubikey for hnowlan, remove legacy key [puppet] - 10https://gerrit.wikimedia.org/r/1268999 [15:14:40] (03CR) 10Muehlenhoff: [C:03+1] "Indeed, and that also happens fairly often any (e.g. when one of the base sets get modified, which happens often)." [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway) [15:16:23] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:16:27] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:16:30] !log andrew@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcephmon2004-dev.codfw.wmnet [15:16:32] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:16:58] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:17:00] (03CR) 10Majavah: [C:03+1] nftables: cleanup tests [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway) [15:17:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:17:09] !log sukhe@lvs1020:~$ sudo systemctl restart pybal.service [15:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:22] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:17:29] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:17:58] (03CR) 10Bking: [C:03+2] bking: add some helpers to dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1268672 (owner: 10Bking) [15:17:59] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:18:04] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:18:41] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:18:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:19:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:19:10] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:19:17] (03CR) 10Majavah: "most of the failures listed on https://puppet-compiler.wmflabs.org/output/1211651/6231/, so https://puppet-compiler.wmflabs.org/output/121" [puppet] - 10https://gerrit.wikimedia.org/r/1266205 (owner: 10Majavah) [15:19:54] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1087.eqiad.wmnet with OS bullseye [15:20:27] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1087 [15:20:49] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:26:03] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [15:26:17] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1087 - bking@cumin2002" [15:27:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1081.eqiad.wmnet with reason: host reimage [15:27:26] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1087 - bking@cumin2002" [15:27:26] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:27:27] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1087.eqiad.wmnet 174.32.64.10.in-addr.arpa 4.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:27:30] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1087.eqiad.wmnet 174.32.64.10.in-addr.arpa 4.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:27:31] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1087 [15:28:09] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1087 [15:28:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1087 [15:28:13] (03CR) 10CDobbins: "`" [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [15:28:29] (03CR) 10Elukey: [C:03+2] ipmi: allow to run commands as another user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1268997 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [15:28:46] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:49] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephmon2004-dev.codfw.wmnet [15:29:18] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin - 3.2 upgrade (T421402) [15:29:21] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [15:30:57] (03CR) 10Cathal Mooney: "Agreed I don't think based on the above or the data Chris shared we can justify sending all of RE to drmrs. If it had been close perhaps " [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [15:32:13] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon2004-dev - https://phabricator.wikimedia.org/T422437#11800245 (10Andrew) a:05Andrew→03None [15:35:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1081.eqiad.wmnet with reason: host reimage [15:35:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1268999 (owner: 10Hnowlan) [15:36:45] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin - 3.2 upgrade (T421402) [15:36:48] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [15:39:06] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1262063 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [15:39:13] (03CR) 10Hnowlan: [C:03+2] admin: add backup yubikey for hnowlan, remove legacy key [puppet] - 10https://gerrit.wikimedia.org/r/1268999 (owner: 10Hnowlan) [15:39:41] fabfur: think I caught your change, okay to merge I assume? [15:39:53] yep thanks [15:40:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:41:51] !log upgrading codfw to haproxy 3.2 (T421402) [15:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:58] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [15:42:06] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw - 3.2 upgrade (T421402) [15:42:12] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw - 3.2 upgrade (T421402) [15:43:54] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1087.eqiad.wmnet with reason: host reimage [15:48:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [15:49:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1087.eqiad.wmnet with reason: host reimage [15:52:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS trixie [15:52:28] !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1023.eqiad.wmnet [15:52:29] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1023.eqiad.wmnet [15:55:57] (03PS1) 10Elukey: sre.network: add workaround for dry-run in run_junos_commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 [16:00:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1081.eqiad.wmnet with OS bullseye [16:04:15] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11800391 (10jcrespo) Let me know when you can. [16:07:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1087.eqiad.wmnet with OS bullseye [16:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:12:21] (03PS49) 10CDobbins: (traffic): add alert for depooled cp* hosts [alerts] - 10https://gerrit.wikimedia.org/r/1217262 [16:12:30] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#11800430 (10Andrew) The only thing left to do here (that I know if) is relative links being messed up in the initial wikitech-static landing pag... [16:14:53] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11800433 (10cmooney) Ok Juniper came back with the following: ` I found that your version 23.4R2-S7.4 is hitting the PR1933049. Unfortunately, this is a confidential PR, but in order to get thi... [16:15:17] (03CR) 10Volans: sre.network: add workaround for dry-run in run_junos_commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 (owner: 10Elukey) [16:19:14] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw - 3.2 upgrade (T421402) [16:19:17] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [16:20:27] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw - 3.2 upgrade (T421402) [16:20:44] (03PS6) 10Eevans: aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) [16:20:44] (03PS7) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) [16:20:44] (03PS7) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [16:20:45] (03PS7) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [16:22:23] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [16:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:38:45] (03PS7) 10Eevans: aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) [16:38:46] (03PS8) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) [16:38:46] (03PS8) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [16:38:46] (03PS8) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [16:39:03] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [16:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:42:36] (03PS2) 10Majavah: dumps: web: Remove plaintext HTTP server [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672) [16:42:36] (03PS1) 10Majavah: dumps: web: Use 429 for connection limit issues [puppet] - 10https://gerrit.wikimedia.org/r/1269021 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1700) [17:00:11] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1088.eqiad.wmnet with OS bullseye [17:01:11] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1088 [17:02:53] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:06:44] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1088 - bking@cumin2002" [17:06:50] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1088 - bking@cumin2002" [17:06:50] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:06:51] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1088.eqiad.wmnet 176.32.64.10.in-addr.arpa 6.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:06:54] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1088.eqiad.wmnet 176.32.64.10.in-addr.arpa 6.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:06:55] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1088 [17:07:31] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1088 [17:07:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1088 [17:08:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2002.codfw.wmnet with OS trixie [17:17:12] (03PS1) 10Daniel Kinzler: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 [17:17:23] (03CR) 10CI reject: [V:04-1] rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 (owner: 10Daniel Kinzler) [17:19:04] (03PS2) 10Daniel Kinzler: rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) [17:19:06] (03CR) 10Daniel Kinzler: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [17:19:20] (03PS3) 10Daniel Kinzler: rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) [17:23:30] (03PS2) 10Daniel Kinzler: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 [17:23:38] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1088.eqiad.wmnet with reason: host reimage [17:27:30] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11801010 (10Papaul) @jcrespo we can do this next week Wednesday April 15th at 10am CT . Thank you. [17:29:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1088.eqiad.wmnet with reason: host reimage [17:35:56] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1089.eqiad.wmnet with OS bullseye [17:36:09] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1089.eqiad.wmnet with OS bullseye [17:36:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [17:37:15] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [17:38:30] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:39:02] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:40:11] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11801044 (10Jhancock.wm) a:03Jhancock.wm good news everybody! there was definitely a power surge on this server. I've been replacing piec... [17:41:09] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420708#11801062 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm finally replaced all the parts that got fried in a power surge. powered up and back in the rack. [17:42:40] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1103 - bking@cumin2002" [17:42:45] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1103 - bking@cumin2002" [17:42:46] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:42:46] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1103.eqiad.wmnet 43.48.64.10.in-addr.arpa 3.4.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:42:50] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1103.eqiad.wmnet 43.48.64.10.in-addr.arpa 3.4.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:42:51] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1103 [17:43:41] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1103 [17:43:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [17:45:20] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11801099 (10jcrespo) 05Open→03Resolved @Jhancock.wm I want to thank you deeply the work, a lot! Please note your work will pay off,... [17:49:01] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:49:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1088.eqiad.wmnet with OS bullseye [18:00:05] dancy and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T1800). [18:00:10] o/ [18:00:16] I'm here to press buttons [18:00:21] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage [18:01:42] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269032 (https://phabricator.wikimedia.org/T420481) [18:01:44] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269032 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [18:02:45] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269032 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [18:04:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage [18:08:26] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.23 refs T420481 [18:08:30] T420481: 1.46.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T420481 [18:16:27] (03CR) 10Eevans: [C:03+2] aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:24:03] (03PS1) 10Jdrewniak: Disable extension:WP25EasterEggs from Wikipedias. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269036 (https://phabricator.wikimedia.org/T422548) [18:25:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1103.eqiad.wmnet with OS bullseye [18:32:52] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [18:33:15] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [18:33:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [18:33:28] (03PS1) 10Jforrester: mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299) [18:45:27] (03PS2) 10Jforrester: mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299) [18:46:42] (03CR) 10RLazarus: [C:03+1] mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299) (owner: 10Jforrester) [18:48:38] dancy: Is it possible for me to sneak out an MW chart fix now? [18:49:38] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage [18:49:53] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 3714.54 ms [18:50:25] FIRING: [3x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:50:27] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [18:50:42] James_F: Yep! [18:50:47] Thanks. [18:50:52] (03CR) 10Jforrester: [C:03+2] mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299) (owner: 10Jforrester) [18:52:56] (03Merged) 10jenkins-bot: mw-mcrouter: add /{dc}/wf-wan routes for Wikifunctions client cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269038 (https://phabricator.wikimedia.org/T422299) (owner: 10Jforrester) [18:53:47] (03CR) 10Stoyofuku-wmf: [C:03+1] "😢 end of an era" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269036 (https://phabricator.wikimedia.org/T422548) (owner: 10Jdrewniak) [18:54:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage [18:55:05] (03PS1) 10Andrew Bogott: Add key for my new (and less destroyed) yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1269042 [18:55:25] FIRING: [4x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:54] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [18:56:15] (03CR) 10Andrew Bogott: [C:03+2] Add key for my new (and less destroyed) yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1269042 (owner: 10Andrew Bogott) [18:56:49] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [18:57:25] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [19:01:32] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1024.eqiad.wmnet with reason: Bootstrapping — T412830 [19:01:35] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [19:02:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:02:34] (03PS2) 10Dzahn: zuul::base: ensure /var/ssh/zuul exists [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938) [19:02:55] ^ looking, I think this is just due to James_F's rollout in progress (i.e. due to the rollout itself, not a problem with the new config) but double-checking [19:04:10] yep, all looks fine except for the churn, that alert will clear on its own when the deployment finishes [19:09:59] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [19:10:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:11:13] ^ also expected [19:11:36] 06SRE, 06Infrastructure-Foundations, 10Mail, 10Phabricator: Replace Exim on phabricator servers with Postfix - https://phabricator.wikimedia.org/T378029#11801471 (10A_smart_kitten) [19:12:39] (03PS3) 10Dzahn: zuul::base: ensure /var/ssh/zuul exists [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938) [19:13:27] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 10Phabricator: Replace Exim on phabricator servers with Postfix - https://phabricator.wikimedia.org/T378029#11801490 (10Dzahn) [19:14:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1103.eqiad.wmnet with OS bullseye [19:15:25] FIRING: [4x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:15:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:16:26] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephmon2007-dev service implementation - https://phabricator.wikimedia.org/T420282#11801494 (10Andrew) 05Open→03Resolved [19:17:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:17:51] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269046 (https://phabricator.wikimedia.org/T128546) [19:20:36] (03PS4) 10Dzahn: zuul::base: ensure /var/ssh/zuul exists [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938) [19:22:37] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [19:23:53] (03CR) 10Bearloga: [C:03+1] EventStreamConfig: remove unused contextual attributes causing problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno) [19:26:03] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1260847/8397/zuul1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:27:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:28:42] ^ still expected, same story [19:29:14] (03CR) 10Dzahn: [C:03+1] gitlab: add feature flag for rsyslog input and disable in devtools [puppet] - 10https://gerrit.wikimedia.org/r/1268946 (https://phabricator.wikimedia.org/T422589) (owner: 10Jelto) [19:35:13] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [19:35:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:36:39] I wonder why that keeps firing right *after* the release finishes, something funny about the timing [19:36:49] it'll self-resolve again though [19:36:55] (03CR) 10Andrew Bogott: [C:03+2] cloudlb haproxy: allow configuring health port for tcp services [puppet] - 10https://gerrit.wikimedia.org/r/1260135 (owner: 10Andrew Bogott) [19:40:33] (03PS1) 10Ladsgroup: Use envoy for swift inside mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) [19:40:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:42:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2000). nyaa~ [20:00:05] toyofuku and toyofuku: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:41] (03CR) 10CDanis: [C:03+1] Use envoy for swift inside mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [20:01:02] oh wow looks like it's just you and me toyofuku [20:01:08] haha perfect [20:01:18] We can do rock paper scissors for who deploys the config patches? [20:02:30] scissors [20:03:09] rock (:< [20:03:31] :o [20:03:47] 😂 [20:03:56] 🪨 [20:06:11] toyofuku: lol, ok you win. Both of these can be done at the same time btw, I can verify the portal one and then I'll run a purge command after it's synced [20:06:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269036 (https://phabricator.wikimedia.org/T422548) (owner: 10Jdrewniak) [20:06:27] oh oops I just started the one config one [20:06:32] I can do the other one after if you'd like [20:06:48] I'm also in eng enclave ftr [20:07:01] (03PS1) 10Dzahn: zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) [20:07:06] np! I [20:07:07] (03Merged) 10jenkins-bot: Disable extension:WP25EasterEggs from Wikipedias. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269036 (https://phabricator.wikimedia.org/T422548) (owner: 10Jdrewniak) [20:07:33] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1269036|Disable extension:WP25EasterEggs from Wikipedias. (T422548)]] [20:07:34] (03CR) 10CI reject: [V:04-1] zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:07:36] T422548: Deployment: Disable the config flag for extension:WP25EasterEggs - https://phabricator.wikimedia.org/T422548 [20:09:26] !log toyofuku@deploy1003 jdrewniak, toyofuku: Backport for [[gerrit:1269036|Disable extension:WP25EasterEggs from Wikipedias. (T422548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:01] when you're done, please ping me. I have a fun patch to push [20:10:27] Verifying on testservers [20:10:37] bye bye babyglobe ): [20:12:03] yup. Looks like enwiki disabled it already, but on mwdebug the config page is now gone too! (as expected) [20:12:03] https://en.wikipedia.org/wiki/Special:CommunityConfiguration/WP25EasterEggs [20:12:25] yeah I was about to say [20:12:31] I'm being gaslit by community config [20:12:51] luckily I speak other languages [20:13:08] !log toyofuku@deploy1003 jdrewniak, toyofuku: Continuing with sync [20:13:17] looks good, moving on [20:15:31] (03PS2) 10Dzahn: zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) [20:17:00] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269036|Disable extension:WP25EasterEggs from Wikipedias. (T422548)]] (duration: 09m 27s) [20:17:03] T422548: Deployment: Disable the config flag for extension:WP25EasterEggs - https://phabricator.wikimedia.org/T422548 [20:18:35] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1269053/8398/zuul1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:18:53] 10SRE-swift-storage, 06Commons: Commons file not found - https://phabricator.wikimedia.org/T413507#11801663 (10Jeff_G) Another file seems damaged, https://commons.wikimedia.org/wiki/File:Ciclo_de_vida_de_Daphnia_magna_(pulga_de_agua)-es.svg (both versions show "Thumbnail for version as of". For the origina... [20:19:18] toyofuku: looks like that's done. Going to start the portal deploy now [20:19:36] Thank you! Sorry switching gears to focus on eng enclave [20:19:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269046 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:20:44] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269046 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:21:01] toyofuku: thanks for deploying! and good game of rock paper scissors, I'll get you next time :P [20:21:11] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]] [20:21:14] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:23:09] !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:22] (03PS1) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) [20:24:23] !log jdrewniak@deploy1003 jdrewniak: Continuing with sync [20:25:01] (03CR) 10CI reject: [V:04-1] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite) [20:26:21] oy! [20:26:21] ``` [20:26:21] 20:24:23 Started sync-canaries-k8s [20:26:21] 20:24:26 K8s deployment progress: 0% (ok: 0; fail: 0; left: 60) [20:26:21] 20:24:32 Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. [20:26:21] 20:24:32 Stdout/stderr follows: [20:26:21] 20:24:32 skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-api-ext/codfw.yaml" [20:26:22] ``` [20:27:39] oh no, deployment error! [20:28:22] https://www.irccloud.com/pastebin/e6QkR3Gr/ [20:28:41] https://spiderpig.wikimedia.org/jobs/1719 for those with access [20:30:06] seems like it rolled back fine, but no idea what that error was about. [20:30:25] FIRING: [5x] ProbeDown: Service aqs1024-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:27] (03PS2) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) [20:30:41] swfrench-wmf: Are you around? [20:30:55] that "skipping missing values file" line is actually fine, your real error is from deeper in [20:30:56] Error: no cached repo found. (try 'helm repo update'): error loading /var/cache/helm/repository/wmf-stable-index.yaml: empty index.yaml file [20:31:00] and that *is* weird [20:31:03] Ah, rzl to the rescue! [20:31:04] (03CR) 10CI reject: [V:04-1] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite) [20:31:48] o/ [20:31:52] puppet race? [20:32:00] I was wondering that. [20:32:00] that's what I was betting on, yeah [20:32:00] (03PS3) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) [20:32:08] This is probably a "just retry" situation. [20:32:15] especially if the rollback worked, I bet a rollforward does too [20:32:20] * swfrench-wmf nods [20:32:26] I'll double-check the timing [20:33:03] jan_drewniak: try it again, and this time believe -- with your whole heart -- that your patch is worthy [20:33:10] (03PS4) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) [20:33:12] haha [20:33:23] (03CR) 10Zabe: Use envoy for swift inside mediawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [20:33:25] alright here goes! [20:34:00] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]] [20:34:03] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:34:05] puppet run from 20:19:14 to 20:24:50 [20:34:09] so yeah, that tracks [20:34:15] ah yep [20:34:55] the window for this race is a lot shorter than that 5½ minutes obviously, but we could still probably be cleverer about this if we needed to [20:35:20] it does feel odd that this has now happened twice in the past month or so ... but I also don't want to draw connections between sparse data points [20:35:49] !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:35:49] oh has it? hm, yeah [20:36:24] !log jdrewniak@deploy1003 jdrewniak: Continuing with sync [20:36:34] I mean "twice" inclusive of this instance of it [20:36:44] nod [20:37:07] (03PS5) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) [20:38:06] (03PS6) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) [20:38:35] (03PS1) 10Andrew Bogott: nova vendordata: disable unattended upgrades in base image [puppet] - 10https://gerrit.wikimedia.org/r/1269056 (https://phabricator.wikimedia.org/T422509) [20:40:14] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269046|Bumping portals to master (T128546)]] (duration: 06m 14s) [20:40:17] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:40:20] (03CR) 10Ladsgroup: Use envoy for swift inside mediawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [20:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:44:01] (03CR) 10Bartosz Dziewoński: [C:03+1] Remove unused JWT for bot password temporary config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01) [20:44:07] rzl/swfrench-wmf: thanks for the eyeballs [20:45:15] * swfrench-wmf thumbs up [20:51:49] (03CR) 10Zabe: Use envoy for swift inside mediawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [20:54:43] jouncebot: nowandnext [20:54:43] For the next 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2000) [20:54:43] In 0 hour(s) and 5 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2100) [20:55:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:56:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [20:57:50] (03Merged) 10jenkins-bot: Use envoy for swift inside mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269050 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [20:58:12] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (T328872)]] [20:58:15] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [20:58:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:56] (03CR) 10Majavah: [C:04-1] "https://phabricator.wikimedia.org/T422509#11801856" [puppet] - 10https://gerrit.wikimedia.org/r/1269056 (https://phabricator.wikimedia.org/T422509) (owner: 10Andrew Bogott) [21:00:04] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (T328872)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2100) [21:00:45] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [21:00:55] I'll be done really quickly [21:04:39] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (T328872)]] (duration: 06m 27s) [21:04:42] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [21:09:59] (03CR) 10RLazarus: [C:03+2] function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [21:12:13] (03Merged) 10jenkins-bot: function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [21:17:25] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:19:05] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:20:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:20:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 499547320 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:21:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 616 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:24:36] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T422748 (10phaultfinder) 03NEW [21:26:04] (03PS1) 10Bernard Wang: Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 [21:26:54] (03CR) 10CI reject: [V:04-1] Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (owner: 10Bernard Wang) [21:27:42] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:36:23] (03PS1) 10RLazarus: Revert "function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269064 (https://phabricator.wikimedia.org/T367880) [21:39:40] (03CR) 10RLazarus: [C:03+2] Revert "function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269064 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [21:41:56] (03Merged) 10jenkins-bot: Revert "function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269064 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [21:45:28] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:45:40] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:46:00] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:47:23] (03PS2) 10Cwhite: add beta-logs pki key [labs/private] - 10https://gerrit.wikimedia.org/r/1268683 (https://phabricator.wikimedia.org/T350516) [21:47:43] (03PS3) 10Cwhite: initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516) [21:49:41] (03CR) 10CI reject: [V:04-1] initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [21:51:10] (03PS4) 10Cwhite: initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516) [21:55:48] (03PS1) 10Ladsgroup: Revert "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269067 [21:56:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269067 (owner: 10Ladsgroup) [21:57:09] (03Merged) 10jenkins-bot: Revert "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269067 (owner: 10Ladsgroup) [21:57:35] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269067|Revert "Use envoy for swift inside mediawiki"]] [21:57:46] (03CR) 10Aude: Enable reading list beta feature for pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (owner: 10Bernard Wang) [21:59:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (owner: 10Bernard Wang) [21:59:30] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1269067|Revert "Use envoy for swift inside mediawiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260408T2200) [22:00:35] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [22:04:29] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269067|Revert "Use envoy for swift inside mediawiki"]] (duration: 06m 54s) [22:04:50] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T422748#11802135 (10phaultfinder) [22:05:25] FIRING: [3x] ProbeDown: Service aqs1024-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:10:56] (03PS1) 10RLazarus: function-{evaluator,orchestrator}: set AppArmor profile in container SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) [22:15:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11802153 (10VRiley-WMF) @Jclark-ctr was it able to finish the provisioning? I attempted to do this with ganeti1055, but it wasn't able to finish. Oddly enough, the... [22:16:42] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11802156 (10MMigurski-WMF) I have updated my email to a wikimedia.org address, and I requested access to the `wmf` group. I believe that might be sufficient for my required access,... [22:19:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11802189 (10Jclark-ctr) >>! In T418903#11802153, @VRiley-WMF wrote: > @Jclark-ctr was it able to finish the provisioning? I attempted to do this with ganeti1055, but... [22:23:23] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 2.165e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [22:25:23] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 1 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [22:42:49] (03PS3) 10Dzahn: zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) [22:45:25] FIRING: [2x] ProbeDown: Service aqs1024-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:59:00] (03PS4) 10Dzahn: zuul::base: use wmflib::mkdir_p to ensure directories [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) [23:00:44] (03CR) 10Dzahn: [V:03+1 C:03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1269053" [puppet] - 10https://gerrit.wikimedia.org/r/1260847 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:01:09] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1269053/8400/" [puppet] - 10https://gerrit.wikimedia.org/r/1269053 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:17:41] (03PS1) 10Jforrester: wikifunctions: Stop testing the v1 orchestrator endpoint, we're dropping it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269072 (https://phabricator.wikimedia.org/T421768) [23:22:45] (03PS1) 10Dzahn: zuul::executor: add TLS full chain needed for zookeeper config [puppet] - 10https://gerrit.wikimedia.org/r/1269073 (https://phabricator.wikimedia.org/T421398) [23:25:41] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1269073/8401/" [puppet] - 10https://gerrit.wikimedia.org/r/1269073 (https://phabricator.wikimedia.org/T421398) (owner: 10Dzahn) [23:28:21] (03CR) 10Jforrester: [C:03+1] function-{evaluator,orchestrator}: set AppArmor profile in container SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [23:39:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:39:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1269077 [23:39:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1269077 (owner: 10TrainBranchBot) [23:49:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1269077 (owner: 10TrainBranchBot)