[00:00:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:00:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:01:52] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:02:52] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:03:20] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS trixie [00:03:26] (03CR) 10Jasmine: [C:03+2] kafka-main2009: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288920 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [00:03:50] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host kafka-main2009 [00:05:52] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:05:58] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:06:52] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:06:54] jasmine@cumin2002 reimage (PID 1563623) is awaiting input [00:06:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:06:58] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:06:58] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:07:56] (03PS1) 10Jasmine: hieradata/common.yaml: add new IPs for kafka-main2009, following vlan migrations [puppet] - 10https://gerrit.wikimedia.org/r/1299656 (https://phabricator.wikimedia.org/T427088) [00:09:32] (03CR) 10Jasmine: [C:03+2] hieradata/common.yaml: add new IPs for kafka-main2009, following vlan migrations [puppet] - 10https://gerrit.wikimedia.org/r/1299656 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [00:10:49] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [00:15:11] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main2009 - jasmine@cumin2002" [00:15:16] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main2009 - jasmine@cumin2002" [00:15:17] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:15:17] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache kafka-main2009.codfw.wmnet 33.48.192.10.in-addr.arpa 3.3.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [00:15:21] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafka-main2009.codfw.wmnet 33.48.192.10.in-addr.arpa 3.3.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [00:15:22] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-main2009 [00:15:40] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-main2009 [00:15:40] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-main2009 [00:24:19] 10SRE-swift-storage, 10MediaWiki-Uploading: "Could not read file" error during upload - https://phabricator.wikimedia.org/T428315#12002913 (10hinnk) Come on now @MatthewVernon, please don't post my IP address to publicly visible pages. That's not cool. The script I'm referring to is the one at [[https://commo... [00:27:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [00:32:26] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [00:32:48] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [00:32:56] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [00:33:38] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [00:33:46] !log jasmine@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [00:34:14] !log jasmine@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [00:34:22] !log jasmine@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [00:35:11] !log jasmine@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [00:35:19] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [00:35:47] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [00:35:55] !log jasmine@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [00:36:45] !log jasmine@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [00:36:52] !log jasmine@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [00:37:41] !log jasmine@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [00:37:50] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [00:38:17] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [00:38:24] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [00:38:46] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage [00:39:12] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [00:39:19] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [00:39:48] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [00:39:54] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [00:40:44] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [00:43:21] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage [00:52:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [00:52:27] \o/ [00:52:33] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:53] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [00:54:14] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [01:00:31] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2009.codfw.wmnet with OS trixie [01:08:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 855.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:09:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1299660 [01:09:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1299660 (owner: 10TrainBranchBot) [01:10:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 12.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:11:29] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main1008.eqiad.wmnet with OS trixie [01:12:00] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host kafka-main1008 [01:15:03] jasmine@cumin2002 reimage (PID 1578727) is awaiting input [01:15:05] (03PS1) 10Jasmine: hieradata/common.yaml: add new IPs for kafka-main1008, following vlan migrations [puppet] - 10https://gerrit.wikimedia.org/r/1299661 (https://phabricator.wikimedia.org/T427088) [01:16:37] (03CR) 10Jasmine: [C:03+2] hieradata/common.yaml: add new IPs for kafka-main1008, following vlan migrations [puppet] - 10https://gerrit.wikimedia.org/r/1299661 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [01:17:42] (03CR) 10Jasmine: [C:03+2] kafka-main1008: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1285476 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [01:18:55] (03PS9) 10Pppery: Add locales for all remaining languages [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221224 (https://phabricator.wikimedia.org/T412651) [01:19:17] (03CR) 10Pppery: "This should now be ready to merge. I might need to add some more locale files later, though." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221224 (https://phabricator.wikimedia.org/T412651) (owner: 10Pppery) [01:19:39] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [01:21:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1299660 (owner: 10TrainBranchBot) [01:23:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 801.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:23:49] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main1008 - jasmine@cumin2002" [01:23:55] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main1008 - jasmine@cumin2002" [01:23:55] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:23:56] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache kafka-main1008.eqiad.wmnet 45.32.64.10.in-addr.arpa 5.4.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [01:23:59] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafka-main1008.eqiad.wmnet 45.32.64.10.in-addr.arpa 5.4.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [01:24:00] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-main1008 [01:24:47] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-main1008 [01:24:47] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-main1008 [01:36:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [01:43:05] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1008.eqiad.wmnet with reason: host reimage [01:43:53] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [01:44:14] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [01:44:23] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [01:45:05] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [01:45:13] !log jasmine@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [01:45:40] !log jasmine@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [01:45:48] !log jasmine@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [01:46:35] !log jasmine@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [01:46:43] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [01:47:09] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [01:47:17] !log jasmine@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [01:48:06] !log jasmine@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [01:48:13] !log jasmine@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [01:49:04] !log jasmine@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [01:49:12] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [01:49:40] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [01:49:47] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [01:49:47] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1008.eqiad.wmnet with reason: host reimage [01:50:35] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [01:50:42] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [01:51:11] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [01:51:18] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [01:52:08] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [01:53:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [01:53:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [01:53:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [01:55:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:56:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 543348648 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:00:51] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:01:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [02:02:45] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [02:03:06] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [02:03:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 165448 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:05:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:07:33] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:33] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:33] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1008.eqiad.wmnet with OS trixie [02:07:38] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 47s) [02:11:51] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12003013 (10Papaul) @colewhite @herron hello Any update on this? Thanks [02:13:56] 10SRE-swift-storage, 10MediaWiki-Uploading: "Could not read file" error during upload - https://phabricator.wikimedia.org/T428315#12003015 (10hinnk) I retried the upload and it worked this time. The previous failures happened at least like four different times over three different days, so I have no idea what... [02:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:34:38] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:49:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:49:38] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:55:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:59:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:55:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:57:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:59:38] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:05:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:20:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:30:00] (03PS1) 10Raymond Ndibe: replica_cnf_api_service: check for missing kubeconfig before _create_envvar [puppet] - 10https://gerrit.wikimedia.org/r/1299671 (https://phabricator.wikimedia.org/T424207) [04:31:52] (03PS2) 10Raymond Ndibe: replica_cnf_api_service: check for missing kubeconfig before _create_envvar [puppet] - 10https://gerrit.wikimedia.org/r/1299671 (https://phabricator.wikimedia.org/T424207) [04:33:54] (03CR) 10CI reject: [V:04-1] replica_cnf_api_service: check for missing kubeconfig before _create_envvar [puppet] - 10https://gerrit.wikimedia.org/r/1299671 (https://phabricator.wikimedia.org/T424207) (owner: 10Raymond Ndibe) [04:35:01] (03CR) 10Raymond Ndibe: replica_cnf_api_service: check for missing kubeconfig before _create_envvar (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1299671 (https://phabricator.wikimedia.org/T424207) (owner: 10Raymond Ndibe) [04:38:04] (03PS3) 10Raymond Ndibe: replica_cnf_api_service: check for missing kubeconfig before _create_envvar [puppet] - 10https://gerrit.wikimedia.org/r/1299671 (https://phabricator.wikimedia.org/T424207) [04:40:15] (03CR) 10CI reject: [V:04-1] replica_cnf_api_service: check for missing kubeconfig before _create_envvar [puppet] - 10https://gerrit.wikimedia.org/r/1299671 (https://phabricator.wikimedia.org/T424207) (owner: 10Raymond Ndibe) [04:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:45:28] (03PS4) 10Raymond Ndibe: replica_cnf_api_service: check for missing kubeconfig before _create_envvar [puppet] - 10https://gerrit.wikimedia.org/r/1299671 (https://phabricator.wikimedia.org/T424207) [04:46:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:49:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:52:33] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:47] (03PS1) 10Abijeet Patro: Enable ULS v2 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 [05:18:10] (03PS2) 10Abijeet Patro: Enable ULS v2 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 [05:18:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [05:41:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [05:42:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [05:43:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [05:43:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [05:56:20] (03PS1) 10Giuseppe Lavagetto: haproxy: get ipblock map directly from HP [puppet] - 10https://gerrit.wikimedia.org/r/1299939 (https://phabricator.wikimedia.org/T422249) [05:56:23] (03PS1) 10Giuseppe Lavagetto: haproxy: use ipblocks map created by hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1299940 (https://phabricator.wikimedia.org/T422249) [05:56:25] (03PS1) 10Giuseppe Lavagetto: haproxy: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/1299941 [05:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 16.03% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T0600) [06:01:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:17:14] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [06:18:02] (03CR) 10Ayounsi: "Nice! I was wondering how much Juniper and nokia diverge for that OpenConfig path?" [puppet] - 10https://gerrit.wikimedia.org/r/1299634 (https://phabricator.wikimedia.org/T428685) (owner: 10Cathal Mooney) [06:21:53] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:29:48] (03CR) 10DCausse: [C:03+1] translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299529 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [06:32:07] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 229.44 ms [06:36:48] (03CR) 10Hashar: [C:03+1] contint: switch apache proxying to jenkins to use https [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:41:41] (03CR) 10Brouberol: "I misunderstood what was asked in the ticket, sorry. In that case, this patch isn't the way to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [06:41:46] (03Abandoned) 10Brouberol: airflow: export the CLASSPATH environment variable into the task-pod shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [06:43:23] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 4287.93 ms [06:44:35] (03CR) 10Arnaudb: Change update to exactly match the given image name (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 (owner: 10Hashar) [06:45:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [06:45:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [06:46:11] (03CR) 10Arnaudb: [C:03+1] "lgtm" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1178068 (https://phabricator.wikimedia.org/T401733) (owner: 10Hashar) [06:52:41] (03PS1) 10Sadiya.mohammed13: Add instance-of WikiProject links for paintings and elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299943 (https://phabricator.wikimedia.org/T422936) [06:53:37] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 248.25 ms [06:53:45] (03CR) 10Nikerabbit: [C:03+1] Enable ULS v2 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [06:55:47] (03PS1) 10Muehlenhoff: Record LDAP access for dmiranda [puppet] - 10https://gerrit.wikimedia.org/r/1299944 [06:56:48] (03CR) 10Arnaudb: [C:03+1] gerrit: flip direction of symlink for log directories [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [06:58:28] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for dmiranda [puppet] - 10https://gerrit.wikimedia.org/r/1299944 (owner: 10Muehlenhoff) [07:00:01] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:00:05] Amir1, urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T0700). nyaa~ [07:00:05] atsukoito: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:03:54] i'm here! [07:03:57] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8686/co" [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [07:04:18] dcausse: hi! [07:04:24] o/ [07:04:28] atsukoito: hey! [07:05:03] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 225.92 ms [07:05:43] * atsukoito logging in to spiderpig [07:06:41] atsukoito: I think you could ship all three patches at once, what do you think? [07:08:00] I think so too but I don't know which branch it will merge it to.. [07:08:37] i will try to put all three in the request [07:09:12] they'll land on the branch that is specified in the gerrit patch [07:09:32] (03PS2) 10Filippo Giunchedi: zookeeper: fail on empty myid [puppet] - 10https://gerrit.wikimedia.org/r/1287893 (https://phabricator.wikimedia.org/T422646) [07:09:47] and mw-config has only one master branch (same for all wiki versions) [07:11:01] lets try then! the train is on group 0 now, so the small wikis already has the patch [07:11:15] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:11:20] yes [07:11:36] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:12:01] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:12:08] !log backporting extensions/Translate to wmf/1.47.0-wmf.5 and applying the config [07:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299556 (https://phabricator.wikimedia.org/T428168) (owner: 10Atsuko) [07:12:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299561 (https://phabricator.wikimedia.org/T428168) (owner: 10Atsuko) [07:12:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299529 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:13:12] (03CR) 10Huei Tan: [C:03+1] Enable ULS v2 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [07:13:30] (03Merged) 10jenkins-bot: translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299529 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:14:08] (03Merged) 10jenkins-bot: ElasticSearchTtmServer: drop include_type_name and support int replicas [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299556 (https://phabricator.wikimedia.org/T428168) (owner: 10Atsuko) [07:14:10] (03Merged) 10jenkins-bot: ElasticSearchTtmServer: clean stale _doc usage and version error output [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299561 (https://phabricator.wikimedia.org/T428168) (owner: 10Atsuko) [07:14:48] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:15:30] !log atsuko@deploy1003 Started scap sync-world: Backport for [[gerrit:1299556|ElasticSearchTtmServer: drop include_type_name and support int replicas (T428168)]], [[gerrit:1299561|ElasticSearchTtmServer: clean stale _doc usage and version error output (T428168)]], [[gerrit:1299529|translate: adding separate read/write endpoints (T425377)]] [07:15:37] T428168: Make Translate compatible with OpenSearch 2 - https://phabricator.wikimedia.org/T428168 [07:15:37] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [07:16:37] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:17:44] !log atsuko@deploy1003 atsuko: Backport for [[gerrit:1299556|ElasticSearchTtmServer: drop include_type_name and support int replicas (T428168)]], [[gerrit:1299561|ElasticSearchTtmServer: clean stale _doc usage and version error output (T428168)]], [[gerrit:1299529|translate: adding separate read/write endpoints (T425377)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be veri [07:17:44] fied there. [07:18:01] dcausse: checking config [07:18:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1287893 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [07:18:20] ack, testing few special pages [07:19:04] new servers are present, populating the index [07:20:28] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:20:31] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [07:21:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:21:17] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:21:19] (03CR) 10JMeybohm: "Given it's in the v3 module, should it also set `ETCDCTL_API=3`?" [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [07:21:27] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:21:31] tested Special:SearchTranslations & Special:Translate, all good [07:21:40] test indices in both eqiad and codfw are populated [07:22:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1215.eqiad.wmnet with reason: Reimage [07:23:05] logs remained quiet on mw-debug servers I believe we could proceed? [07:23:06] ttmserver-test indices https://www.irccloud.com/pastebin/vam9J3HG/ [07:23:12] looking [07:23:38] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:23:46] !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db1215.eqiad.wmnet with OS trixie [07:23:55] (03CR) 10Filippo Giunchedi: [C:03+2] zookeeper: fail on empty myid [puppet] - 10https://gerrit.wikimedia.org/r/1287893 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [07:24:51] atsukoito: sounds good to me [07:24:56] (03CR) 10JMeybohm: etcd: make etcdctl work out of the box (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [07:25:09] !log atsuko@deploy1003 atsuko: Continuing with deployment [07:25:29] (03CR) 10Elukey: redfish: improve add_account with AccountTypes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [07:25:35] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: orchestrator show-resolve-hosts failed: 2026-06-10 07:25:31 ERROR dial tcp 10.64.16.90:3306: i/o timeout https://wikitech.wikimedia.org/wiki/Orchestrator [07:26:32] atsukoito: quick note, after shipping, translation data will start to get populated for prod wikis (meta & wiki) into the ttmserver index via the mw job queue [07:26:59] s/meta & wiki/meta & wikidata/ [07:28:00] dcausse: can i invite you to the today's standup to explain what is translate memory on the wiki and who sees it? [07:28:15] it is in 1h 30m [07:28:17] all that to say that we'll have to monitor a bit the logs after shipping because mw jobrunners are not part of the scap "test phase" [07:28:21] atsukoito: sure [07:29:34] !log atsuko@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299556|ElasticSearchTtmServer: drop include_type_name and support int replicas (T428168)]], [[gerrit:1299561|ElasticSearchTtmServer: clean stale _doc usage and version error output (T428168)]], [[gerrit:1299529|translate: adding separate read/write endpoints (T425377)]] (duration: 14m 03s) [07:29:40] T428168: Make Translate compatible with OpenSearch 2 - https://phabricator.wikimedia.org/T428168 [07:29:41] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [07:30:32] dcausse: I see prod indices has started to get data, too [07:30:39] (03PS2) 10Filippo Giunchedi: etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) [07:30:39] (03PS3) 10Filippo Giunchedi: icinga: remove toolschecker-based checks [puppet] - 10https://gerrit.wikimedia.org/r/1298742 (https://phabricator.wikimedia.org/T313030) [07:30:39] (03PS2) 10Filippo Giunchedi: toolforge: remove checker access from k8s::etcd [puppet] - 10https://gerrit.wikimedia.org/r/1299546 (https://phabricator.wikimedia.org/T313030) [07:30:39] (03PS2) 10Filippo Giunchedi: Remove toolschecker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/1299547 (https://phabricator.wikimedia.org/T313030) [07:30:40] (03PS1) 10Filippo Giunchedi: fixup! etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299946 [07:30:55] nice! [07:31:42] (03CR) 10Filippo Giunchedi: "indeed, still needed until etcd 3.4 😭" [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [07:33:43] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:34:09] (03CR) 10Filippo Giunchedi: etcd: make etcdctl work out of the box (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [07:35:55] (03PS3) 10Filippo Giunchedi: etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) [07:35:55] (03PS4) 10Filippo Giunchedi: icinga: remove toolschecker-based checks [puppet] - 10https://gerrit.wikimedia.org/r/1298742 (https://phabricator.wikimedia.org/T313030) [07:35:55] (03PS3) 10Filippo Giunchedi: toolforge: remove checker access from k8s::etcd [puppet] - 10https://gerrit.wikimedia.org/r/1299546 (https://phabricator.wikimedia.org/T313030) [07:35:55] pardon the gerrit spam [07:35:55] (03PS3) 10Filippo Giunchedi: Remove toolschecker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/1299547 (https://phabricator.wikimedia.org/T313030) [07:36:20] (03Abandoned) 10Filippo Giunchedi: fixup! etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299946 (owner: 10Filippo Giunchedi) [07:37:44] !log javiermonton@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [07:38:11] !log javiermonton@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [07:38:32] (03CR) 10Ayounsi: "One comment otherwise lgtm!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) (owner: 10Cathal Mooney) [07:39:17] !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1215.eqiad.wmnet with reason: host reimage [07:40:33] !log installing openssl security updates [07:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:41] !log javiermonton@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [07:41:02] !log javiermonton@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [07:44:09] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1215.eqiad.wmnet with reason: host reimage [07:48:30] !log javiermonton@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [07:48:49] !log javiermonton@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [07:49:12] dcausse: i don't see any errors [07:49:27] * atsukoito will go back to #wikimedia-search [07:50:00] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:50:57] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:51:40] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:52:11] (03CR) 10CWilliams: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [07:52:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:52:43] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:53:38] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:55:52] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:56:17] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:57:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:57:27] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:57:48] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:59:38] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:00:05] dduvall and jnuche: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T0800). nyaa~ [08:01:21] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db1215.eqiad.wmnet with OS trixie [08:04:21] fceratto@cumin1003 major-upgrade (PID 2619557) is awaiting input [08:04:56] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:05:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:06:08] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:06:45] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [08:07:51] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [08:08:35] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:08:46] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:08:47] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:09:08] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:09:25] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:11:48] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:18:18] (03PS1) 10Jcrespo: dbbackups: Reenable regular es backups and update RO job ids [puppet] - 10https://gerrit.wikimedia.org/r/1300041 (https://phabricator.wikimedia.org/T427357) [08:19:13] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300041 (https://phabricator.wikimedia.org/T427357) (owner: 10Jcrespo) [08:20:34] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:22:05] (03PS2) 10Jcrespo: dbbackups: Reenable regular es backups and update RO job ids [puppet] - 10https://gerrit.wikimedia.org/r/1300041 (https://phabricator.wikimedia.org/T427357) [08:22:44] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300041 (https://phabricator.wikimedia.org/T427357) (owner: 10Jcrespo) [08:23:13] (03CR) 10Jcrespo: [C:04-2] "Not until last job finishes today" [puppet] - 10https://gerrit.wikimedia.org/r/1300041 (https://phabricator.wikimedia.org/T427357) (owner: 10Jcrespo) [08:26:31] (03PS2) 10Arnaudb: gitlab: add gitlab-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1298744 (https://phabricator.wikimedia.org/T425441) [08:27:09] (03CR) 10Arnaudb: "no problem, let me know :-)" [dns] - 10https://gerrit.wikimedia.org/r/1298744 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:27:43] (03PS3) 10Arnaudb: gitlab: add gitlab-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1298744 (https://phabricator.wikimedia.org/T425441) [08:29:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:29:18] (03CR) 10Jelto: [C:03+1] "lgtm, thank you" [dns] - 10https://gerrit.wikimedia.org/r/1298744 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:29:46] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:30:37] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:31:55] (03CR) 10Jaime Nuche: [C:03+1] releases: remove outdated comments about releases-jenkins in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1299585 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [08:37:26] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8687/co" [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [08:37:49] (03CR) 10Arnaudb: service: add gitlab-https and gitlab-ssh service to service catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:45:09] (03CR) 10Wangombe: [C:03+1] Enable ULS v2 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [08:46:34] 06SRE-OnFire, 10Cloud-VPS, 06cloud-services-team (FY2025/2026-Q3-Q4), 13Patch-For-Review, 07Sustainability (Incident Followup): Add external meta-monitoring for metricsinfra - https://phabricator.wikimedia.org/T288053#12003600 (10fgiunchedi) [08:52:33] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:11] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:54:27] (03PS4) 10JMeybohm: etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [08:54:31] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [08:55:47] (03CR) 10Hashar: Change update to exactly match the given image name (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 (owner: 10Hashar) [08:55:59] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:59:28] (03CR) 10Arnaudb: Change update to exactly match the given image name (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 (owner: 10Hashar) [09:00:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:00:40] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:02:03] jouncebot: nowandnext [09:02:03] For the next 0 hour(s) and 57 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T0800) [09:02:03] In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1000) [09:02:25] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:02:35] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:03:07] (03PS1) 10Jelto: gerrit: update rsyslog configuration for apache logs [puppet] - 10https://gerrit.wikimedia.org/r/1300049 (https://phabricator.wikimedia.org/T425667) [09:03:09] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:03:18] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:03:57] (03CR) 10Arnaudb: [C:03+1] "thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/1300049 (https://phabricator.wikimedia.org/T425667) (owner: 10Jelto) [09:04:54] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:05:02] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:05:16] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8688/co" [puppet] - 10https://gerrit.wikimedia.org/r/1300049 (https://phabricator.wikimedia.org/T425667) (owner: 10Jelto) [09:05:16] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:05:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:05:37] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:15:40] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:15:43] (03PS3) 10Cathal Mooney: Validators - add check to make sure dns_name is unique [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) [09:17:46] (03CR) 10Cathal Mooney: Validators - add check to make sure dns_name is unique (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) (owner: 10Cathal Mooney) [09:17:50] (03CR) 10CI reject: [V:04-1] Validators - add check to make sure dns_name is unique [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) (owner: 10Cathal Mooney) [09:18:31] (03PS4) 10Cathal Mooney: Validators - add check to make sure dns_name is unique [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) [09:18:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus5003.eqsin.wmnet to plain [09:19:39] (03CR) 10Hashar: Change update to exactly match the given image name (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 (owner: 10Hashar) [09:20:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus5003.eqsin.wmnet to plain [09:20:43] PROBLEM - Host prometheus5003 is DOWN: CRITICAL - Host Unreachable (10.132.2.5) [09:20:45] (03CR) 10Ayounsi: [C:03+1] Validators - add check to make sure dns_name is unique [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) (owner: 10Cathal Mooney) [09:21:17] RECOVERY - Host prometheus5003 is UP: PING OK - Packet loss = 0%, RTA = 234.21 ms [09:21:32] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: update rsyslog configuration for apache logs [puppet] - 10https://gerrit.wikimedia.org/r/1300049 (https://phabricator.wikimedia.org/T425667) (owner: 10Jelto) [09:22:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [09:22:59] (03PS2) 10Hashar: Change update to exactly match the given image name [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 [09:23:18] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [09:23:43] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [09:23:55] (03CR) 10Clément Goubert: [C:03+1] etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [09:24:00] (03CR) 10Cathal Mooney: [C:03+2] Validators - add check to make sure dns_name is unique [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) (owner: 10Cathal Mooney) [09:26:09] !log upgrade routinator in eqiad to 0.15.2 T428456 [09:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:13] T428456: Upgrade Routinator to 0.15.2 - https://phabricator.wikimedia.org/T428456 [09:26:39] (03PS1) 10Kosta Harlan: SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300058 (https://phabricator.wikimedia.org/T425929) [09:26:51] jouncebot: nowandnext [09:26:51] For the next 0 hour(s) and 33 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T0800) [09:26:51] In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1000) [09:29:23] (03PS1) 10Kosta Harlan: SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300059 (https://phabricator.wikimedia.org/T425929) [09:29:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300058 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [09:29:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300059 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [09:30:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12003754 (10BTullis) [09:30:23] (03Merged) 10jenkins-bot: Validators - add check to make sure dns_name is unique [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) (owner: 10Cathal Mooney) [09:30:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12003757 (10BTullis) [09:32:06] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [09:32:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [09:33:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#12003775 (10BTullis) a:05bking→03BTullis I'll take this task, along with T423312, since there is some work on partman requ... [09:34:24] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [09:34:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12003785 (10BTullis) a:05Jhancock.wm→03BTullis I'll take this task, along with T423314... [09:35:21] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:36:44] (03PS1) 10Brouberol: Add comments describing each group of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 [09:37:07] (03PS2) 10Brouberol: admin_ng: Add comments describing each group of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 [09:37:15] (03PS1) 10Brouberol: admin_ng/dse-k8s: add comments explaining custom labels and deploy roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300050 [09:37:21] (03PS1) 10Brouberol: admin_ng/dse-k8s: alphabetically sort namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300051 [09:37:24] (03PS1) 10Brouberol: admin_ng/dse-k8s: values automatic reformatting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300052 [09:37:28] (03PS1) 10Brouberol: admin_ng/dse-k8s/ceph: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300053 [09:37:32] (03PS1) 10Brouberol: admin_ng/dse-k8s-eqiad/cloudnative-pg sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300054 [09:37:36] (03PS1) 10Brouberol: admin_ng/dse-k8s/opensearch: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300055 [09:37:41] (03PS1) 10Brouberol: admin_ng/dse-k8s-eqiad/flink: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300056 [09:37:45] (03PS1) 10Brouberol: admin_ng/dse-k8s/cfssl-issuer: add comments to k8s_dse_opensearch profile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300057 [09:43:04] (03Merged) 10jenkins-bot: SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300058 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [09:43:06] (03CR) 10Btullis: [C:03+1] admin_ng/dse-k8s: add comments explaining custom labels and deploy roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300050 (owner: 10Brouberol) [09:43:08] (03CR) 10CI reject: [V:04-1] SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300059 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [09:43:31] (03PS1) 10Blake: ProductionServices: reboot poolcounter1006.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300064 (https://phabricator.wikimedia.org/T426736) [09:43:40] (03PS1) 10Clément Goubert: rest-gateway: Cache liftwing-openapi-specs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300065 (https://phabricator.wikimedia.org/T427902) [09:43:53] (03CR) 10Btullis: [C:03+1] admin_ng/dse-k8s: alphabetically sort namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300051 (owner: 10Brouberol) [09:45:08] (03CR) 10Clément Goubert: [C:03+1] ProductionServices: reboot poolcounter1006.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300064 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [09:45:10] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - https://phabricator.wikimedia.org/T427469#12003807 (10tappof) You can filter alerts of interest using `team=data-persistence`. The AlertLintProblem check groups together different linting issues, which are described in detail in... [09:45:17] (03CR) 10Brouberol: [C:03+2] admin_ng/dse-k8s: add comments explaining custom labels and deploy roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300050 (owner: 10Brouberol) [09:45:23] (03CR) 10Brouberol: [C:03+2] admin_ng/dse-k8s: alphabetically sort namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300051 (owner: 10Brouberol) [09:45:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300059 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [09:46:24] (03CR) 10Btullis: admin_ng/dse-k8s: values automatic reformatting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300052 (owner: 10Brouberol) [09:46:59] (03Merged) 10jenkins-bot: SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300059 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [09:47:28] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1300058|SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed (T425929)]], [[gerrit:1300059|SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed (T425929)]] [09:47:33] T425929: Cannot publish after dismissing hCaptcha challenge triggered by AbuseFilter on mobile source editor - https://phabricator.wikimedia.org/T425929 [09:48:20] (03CR) 10Atsuko: [C:03+1] admin_ng/dse-k8s/cfssl-issuer: add comments to k8s_dse_opensearch profile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300057 (owner: 10Brouberol) [09:48:49] (03CR) 10Atsuko: [C:03+1] admin_ng/dse-k8s/opensearch: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300055 (owner: 10Brouberol) [09:49:30] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1300058|SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed (T425929)]], [[gerrit:1300059|SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed (T425929)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:49:58] (03CR) 10Atsuko: [C:03+1] admin_ng: Add comments describing each group of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 (owner: 10Brouberol) [09:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.49% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:50:28] (03CR) 10Brouberol: admin_ng/dse-k8s: values automatic reformatting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300052 (owner: 10Brouberol) [09:51:58] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300065 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [09:52:11] (03CR) 10Arthur taylor: "This looks good - please rebase it against the latest version of the other change in this patch-chain (Ib1babeda984e523a77b659f1d1b8175162" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299943 (https://phabricator.wikimedia.org/T422936) (owner: 10Sadiya.mohammed13) [09:52:34] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:53:58] (03Merged) 10jenkins-bot: admin_ng/dse-k8s: add comments explaining custom labels and deploy roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300050 (owner: 10Brouberol) [09:54:21] (03Merged) 10jenkins-bot: admin_ng/dse-k8s: alphabetically sort namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300051 (owner: 10Brouberol) [09:55:10] (03PS12) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [09:56:12] (03CR) 10JMeybohm: [C:03+1] etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [09:57:01] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300058|SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed (T425929)]], [[gerrit:1300059|SourceEditorOverlay: Show CAPTCHA panel when AF challenge closed (T425929)]] (duration: 09m 32s) [09:57:06] T425929: Cannot publish after dismissing hCaptcha challenge triggered by AbuseFilter on mobile source editor - https://phabricator.wikimedia.org/T425929 [09:58:17] (03CR) 10Cathal Mooney: "I'll respond on task" [puppet] - 10https://gerrit.wikimedia.org/r/1299634 (https://phabricator.wikimedia.org/T428685) (owner: 10Cathal Mooney) [09:59:55] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T426809#12003884 (10tappof) 05Open→03Resolved [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1000) [10:00:13] (03CR) 10Ayounsi: [C:03+1] "Thanks, then let's more forward with this." [puppet] - 10https://gerrit.wikimedia.org/r/1299634 (https://phabricator.wikimedia.org/T428685) (owner: 10Cathal Mooney) [10:00:20] just a heads up - i'm hoping to use the infra window in a moment to reboot poolcounter servers, which will want backports [10:01:34] (03CR) 10Blake: [C:03+2] ProductionServices: reboot poolcounter1006.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300064 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:03:52] (03Merged) 10jenkins-bot: ProductionServices: reboot poolcounter1006.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300064 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:04:36] !log blake@deploy1003 Started scap sync-world: Backport for [[gerrit:1300064|ProductionServices: reboot poolcounter1006.eqiad (T426736)]] [10:05:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:06:17] (03Abandoned) 10Hashar: Remove the ssh module's unused init.pp [puppet] - 10https://gerrit.wikimedia.org/r/507095 (owner: 10Alex Monk) [10:06:51] (03PS1) 10Gkyziridis: wgRestSandboxSpecs: Add Lift Wing API to documentation wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300073 (https://phabricator.wikimedia.org/T427902) [10:06:55] !log blake@deploy1003 blake: Backport for [[gerrit:1300064|ProductionServices: reboot poolcounter1006.eqiad (T426736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:07:11] (03Abandoned) 10Hashar: Standardizes English dictionaries on hunspell for English in ORES [puppet] - 10https://gerrit.wikimedia.org/r/556023 (https://phabricator.wikimedia.org/T239942) (owner: 10Halfak) [10:07:34] !log blake@deploy1003 blake: Continuing with deployment [10:08:30] (03CR) 10Brouberol: admin_ng/dse-k8s: values automatic reformatting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300052 (owner: 10Brouberol) [10:10:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:12:22] !log blake@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300064|ProductionServices: reboot poolcounter1006.eqiad (T426736)]] (duration: 07m 46s) [10:13:14] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter1006.eqiad.wmnet [10:13:37] (03PS1) 10Blake: ProductionServices: reboot poolcounter1007.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300072 (https://phabricator.wikimedia.org/T426736) [10:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:14:18] (03CR) 10Effie Mouzeli: [C:03+1] ProductionServices: reboot poolcounter1007.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300072 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:14:30] !log root@cumin1003 START - Cookbook sre.mysql.depool depool db1152: Security updates [10:14:30] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [10:14:38] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [10:14:38] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1152: Security updates [10:15:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:16:05] (03CR) 10Btullis: [C:03+1] admin_ng/dse-k8s: values automatic reformatting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300052 (owner: 10Brouberol) [10:16:31] (03CR) 10Btullis: [C:03+1] admin_ng/dse-k8s/ceph: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300053 (owner: 10Brouberol) [10:16:34] (03CR) 10Brouberol: [C:03+2] admin_ng/dse-k8s: values automatic reformatting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300052 (owner: 10Brouberol) [10:16:38] (03PS1) 10Clément Goubert: tls_terminator: Convert size to kB for rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1300077 (https://phabricator.wikimedia.org/T414440) [10:16:59] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1006.eqiad.wmnet [10:17:01] (03PS2) 10Blake: ProductionServices: reboot poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300072 (https://phabricator.wikimedia.org/T426736) [10:17:46] (03CR) 10Btullis: admin_ng/dse-k8s-eqiad/cloudnative-pg sort out tenant namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300054 (owner: 10Brouberol) [10:17:47] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Cache liftwing-openapi-specs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300065 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [10:17:55] (03PS3) 10Blake: ProductionServices: reboot poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300072 (https://phabricator.wikimedia.org/T426736) [10:18:11] (03CR) 10Btullis: [C:03+1] admin_ng/dse-k8s/opensearch: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300055 (owner: 10Brouberol) [10:18:38] (03CR) 10Btullis: [C:03+1] admin_ng/dse-k8s-eqiad/flink: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300056 (owner: 10Brouberol) [10:19:03] (03CR) 10Blake: [C:03+2] ProductionServices: reboot poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300072 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:19:20] (03CR) 10Btullis: [C:03+1] admin_ng/dse-k8s/cfssl-issuer: add comments to k8s_dse_opensearch profile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300057 (owner: 10Brouberol) [10:20:04] PROBLEM - MariaDB Replica IO: ms1 on db2251 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1152.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1152.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [10:20:08] (03Merged) 10jenkins-bot: rest-gateway: Cache liftwing-openapi-specs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300065 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [10:20:28] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:20:48] (03Merged) 10jenkins-bot: ProductionServices: reboot poolcounter1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300072 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:21:01] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:21:07] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:21:08] (03CR) 10Btullis: admin_ng: Add comments describing each group of environments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 (owner: 10Brouberol) [10:21:30] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:21:34] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:21:35] (03PS3) 10Brouberol: admin_ng: Add comments describing each group of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 [10:21:37] !log blake@deploy1003 Started scap sync-world: Backport for [[gerrit:1300072|ProductionServices: reboot poolcounter1007 (T426736)]] [10:21:40] (03CR) 10Brouberol: admin_ng: Add comments describing each group of environments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 (owner: 10Brouberol) [10:21:52] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:21:53] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:22:29] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:22:48] (03CR) 10Brouberol: [C:03+2] admin_ng/dse-k8s/ceph: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300053 (owner: 10Brouberol) [10:23:44] !log blake@deploy1003 blake: Backport for [[gerrit:1300072|ProductionServices: reboot poolcounter1007 (T426736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:24:56] !log blake@deploy1003 blake: Continuing with deployment [10:25:04] RECOVERY - MariaDB Replica IO: ms1 on db2251 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [10:27:09] !log jmm@cumin2002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for sretest2009.codfw.wmnet: Renew puppet certificate - jmm@cumin2002 [10:27:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [10:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:29:18] (03CR) 10Brouberol: admin_ng/dse-k8s-eqiad/cloudnative-pg sort out tenant namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300054 (owner: 10Brouberol) [10:29:22] !log blake@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300072|ProductionServices: reboot poolcounter1007 (T426736)]] (duration: 07m 45s) [10:29:38] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter1007.eqiad.wmnet [10:31:17] PROBLEM - MariaDB Replica IO: ms1 on db1152 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2251.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2251.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [10:31:55] (03PS2) 10Blake: ProductionServices: reboot poolcounter2005, re-add poolcounter 1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300082 (https://phabricator.wikimedia.org/T426736) [10:32:33] (03CR) 10Brouberol: [C:03+2] admin_ng/dse-k8s-eqiad/cloudnative-pg sort out tenant namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300054 (owner: 10Brouberol) [10:32:39] (03PS1) 10Marco Fossati: TestKitchen: enable instrument config fetching on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300085 (https://phabricator.wikimedia.org/T426231) [10:32:59] (03CR) 10Clément Goubert: [C:03+1] ProductionServices: reboot poolcounter2005, re-add poolcounter 1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300082 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:33:17] RECOVERY - MariaDB Replica IO: ms1 on db1152 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [10:33:25] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1007.eqiad.wmnet [10:33:49] (03CR) 10Cathal Mooney: [C:03+2] Nokia SR-Linux: get specific component status with gnmic [puppet] - 10https://gerrit.wikimedia.org/r/1299634 (https://phabricator.wikimedia.org/T428685) (owner: 10Cathal Mooney) [10:33:53] (03CR) 10Blake: [C:03+2] ProductionServices: reboot poolcounter2005, re-add poolcounter 1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300082 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:33:54] (03PS2) 10Marco Fossati: TestKitchen: enable instrument config fetching on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300085 (https://phabricator.wikimedia.org/T426231) [10:34:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:34:50] (03Merged) 10jenkins-bot: ProductionServices: reboot poolcounter2005, re-add poolcounter 1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300082 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:34:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:35:06] (03CR) 10Brouberol: [C:03+2] admin_ng/dse-k8s/opensearch: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300055 (owner: 10Brouberol) [10:35:45] !log blake@deploy1003 Started scap sync-world: Backport for [[gerrit:1300082|ProductionServices: reboot poolcounter2005, re-add poolcounter 1007 (T426736)]] [10:37:27] !log failover Ganeti master in eqsin to ganeti5007 T428229 [10:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:31] T428229: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229 [10:37:49] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:38:06] !log blake@deploy1003 blake: Backport for [[gerrit:1300082|ProductionServices: reboot poolcounter2005, re-add poolcounter 1007 (T426736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:38:37] !log root@cumin1003 START - Cookbook sre.mysql.pool pool db1152: Security updates [10:38:37] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [10:38:50] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [10:38:50] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1152: Security updates [10:38:51] !log blake@deploy1003 blake: Continuing with deployment [10:40:24] (03PS1) 10Gkyziridis: ml-services: add liftwing-openapi-server latest version deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300089 (https://phabricator.wikimedia.org/T427902) [10:41:02] !log installing nginx security updates [10:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:22] (03Merged) 10jenkins-bot: admin_ng/dse-k8s/opensearch: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300055 (owner: 10Brouberol) [10:43:24] !log blake@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300082|ProductionServices: reboot poolcounter2005, re-add poolcounter 1007 (T426736)]] (duration: 07m 38s) [10:43:40] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter2005.codfw.wmnet [10:45:18] (03CR) 10Brouberol: [C:03+2] admin_ng/dse-k8s-eqiad/flink: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300056 (owner: 10Brouberol) [10:45:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:46:11] (03PS1) 10Blake: ProductionServices: reboot poolcounter2006, re-add poolcounter 2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300087 (https://phabricator.wikimedia.org/T426736) [10:46:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:46:44] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:47:21] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:47:37] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2005.codfw.wmnet [10:47:43] (03CR) 10Blake: [C:03+2] ProductionServices: reboot poolcounter2006, re-add poolcounter 2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300087 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:48:39] (03Merged) 10jenkins-bot: ProductionServices: reboot poolcounter2006, re-add poolcounter 2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300087 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:48:51] 06SRE, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#12004064 (10hnowlan) a:03tappof can this be resolved? [10:49:19] !log blake@deploy1003 Started scap sync-world: Backport for [[gerrit:1300087|ProductionServices: reboot poolcounter2006, re-add poolcounter 2005 (T426736)]] [10:49:28] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - https://phabricator.wikimedia.org/T427469#12004067 (10Marostegui) I am not really sure I understand what the issue is here. As far as I know we've not touched that alert (MySQLReplicaNotUsingGTID) in ages. It should return a hos... [10:49:50] 06SRE, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#12004069 (10tappof) 05Open→03Resolved Yes. Resolved. [10:50:36] (03PS1) 10Clément Goubert: ratelimit-media: bump nutcracker mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300090 [10:51:24] !log blake@deploy1003 blake: Backport for [[gerrit:1300087|ProductionServices: reboot poolcounter2006, re-add poolcounter 2005 (T426736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:51:31] !log remove ganeti5004 from eqsin cluster for reimage T428229 [10:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:35] T428229: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229 [10:51:44] !log blake@deploy1003 blake: Continuing with deployment [10:53:11] (03PS1) 10Blake: ProductionServices: re-add poolcounter2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300092 (https://phabricator.wikimedia.org/T426736) [10:53:39] (03CR) 10Clément Goubert: [C:03+2] ratelimit-media: bump nutcracker mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300090 (owner: 10Clément Goubert) [10:54:01] (03Merged) 10jenkins-bot: admin_ng/dse-k8s-eqiad/flink: sort out tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300056 (owner: 10Brouberol) [10:54:37] (03CR) 10Clément Goubert: [C:03+1] ProductionServices: re-add poolcounter2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300092 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [10:54:38] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:54:49] PROBLEM - ganeti-confd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:54:49] PROBLEM - ganeti-noded running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:54:50] FIRING: ProbeDown: Service ganeti5004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:56:01] !log blake@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300087|ProductionServices: reboot poolcounter2006, re-add poolcounter 2005 (T426736)]] (duration: 06m 42s) [10:56:07] (03Merged) 10jenkins-bot: ratelimit-media: bump nutcracker mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300090 (owner: 10Clément Goubert) [10:56:11] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter2006.codfw.wmnet [10:56:34] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [10:56:42] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [10:56:47] i have one backport left to do to bring the last poolcounter server back, so might overrun the infra window by up to 10m [10:56:48] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [10:57:07] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [10:57:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:57:11] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [10:57:27] (03CR) 10Majavah: [C:03+1] wmf-config: Update private subnets to include additions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [10:57:29] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [10:59:58] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2006.codfw.wmnet [11:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1100). [11:00:11] (03CR) 10Blake: [C:03+2] ProductionServices: re-add poolcounter2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300092 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [11:01:07] (03Merged) 10jenkins-bot: ProductionServices: re-add poolcounter2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300092 (https://phabricator.wikimedia.org/T426736) (owner: 10Blake) [11:01:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:01:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor= - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:01:50] !log blake@deploy1003 Started scap sync-world: Backport for [[gerrit:1300092|ProductionServices: re-add poolcounter2006 (T426736)]] [11:02:00] (03PS2) 10Brouberol: admin_ng/dse-k8s/cfssl-issuer: add comments to k8s_dse_opensearch profile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300057 [11:02:00] (03PS4) 10Brouberol: admin_ng: Add comments describing each group of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 [11:02:11] (03CR) 10Brouberol: admin_ng/dse-k8s/cfssl-issuer: add comments to k8s_dse_opensearch profile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300057 (owner: 10Brouberol) [11:02:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:02:33] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:03:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:04:07] !log blake@deploy1003 blake: Backport for [[gerrit:1300092|ProductionServices: re-add poolcounter2006 (T426736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:04:30] !log blake@deploy1003 blake: Continuing with deployment [11:04:38] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:06:39] FIRING: [5x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:07:23] (03CR) 10Blake: [C:03+1] docker-registry: switch to rdb1015 #3 [puppet] - 10https://gerrit.wikimedia.org/r/1299467 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [11:08:45] !log blake@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300092|ProductionServices: re-add poolcounter2006 (T426736)]] (duration: 06m 55s) [11:08:56] i'm done with backports, apologies for the delay :) [11:09:02] !log root@cumin1003 START - Cookbook sre.mysql.depool depool db1151: Security updates [11:09:02] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [11:09:10] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:09:10] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1151: Security updates [11:09:21] (03CR) 10Effie Mouzeli: [C:03+2] docker-registry: switch to rdb1015 #3 [puppet] - 10https://gerrit.wikimedia.org/r/1299467 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [11:10:00] jouncebot: now [11:10:00] For the next 0 hour(s) and 49 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1100) [11:10:04] grand [11:12:07] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - https://phabricator.wikimedia.org/T427469#12004164 (10tappof) Ok, the AlertLintProblem alert here is making you aware that you have configured an alert on some series that were generated by an exporter and are no longer present.... [11:12:47] (03CR) 10Btullis: [C:03+1] admin_ng: Add comments describing each group of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 (owner: 10Brouberol) [11:14:51] (03CR) 10Brouberol: [C:03+2] admin_ng/dse-k8s/cfssl-issuer: add comments to k8s_dse_opensearch profile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300057 (owner: 10Brouberol) [11:14:59] (03CR) 10Brouberol: [C:03+2] admin_ng: Add comments describing each group of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 (owner: 10Brouberol) [11:15:11] PROBLEM - MariaDB Replica IO: ms2 on db2253 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1151.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1151.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [11:15:30] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [11:15:45] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [11:15:50] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [11:16:02] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [11:16:39] FIRING: [5x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:17:05] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298831 (owner: 10PipelineBot) [11:18:11] RECOVERY - MariaDB Replica IO: ms2 on db2253 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [11:18:12] (03CR) 10Arnaudb: [C:03+1] "Adding @cgoubert@wikimedia.org and @cdanis@wikimedia.org as reviewers for this as I'm not familiar with the next steps once it's submitted" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1178068 (https://phabricator.wikimedia.org/T401733) (owner: 10Hashar) [11:21:36] (03CR) 10AikoChou: [C:03+1] ml-services: add liftwing-openapi-server latest version deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300089 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:21:39] FIRING: [5x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:22:23] (03CR) 10Gkyziridis: [C:03+2] ml-services: add liftwing-openapi-server latest version deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300089 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:23:11] (03Merged) 10jenkins-bot: admin_ng/dse-k8s/cfssl-issuer: add comments to k8s_dse_opensearch profile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300057 (owner: 10Brouberol) [11:23:19] (03Merged) 10jenkins-bot: admin_ng: Add comments describing each group of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300061 (owner: 10Brouberol) [11:23:37] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [11:23:41] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [11:23:45] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [11:23:52] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [11:23:54] 06SRE, 10Observability-Metrics: Infrastructure-related Grafana dashboards should not be split by data center - https://phabricator.wikimedia.org/T406472#12004194 (10hnowlan) This is a fairly wide-reaching change which would require every team to modify their dashboards as the DC selector is per-dashboard - I'm... [11:25:30] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298831 (owner: 10PipelineBot) [11:25:39] PROBLEM - MariaDB Replica IO: ms2 on db1151 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2253.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2253.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [11:27:10] (03PS3) 10Klausman: home/klausman: Add kubectl script and tweak tmuxp recipe to use it [puppet] - 10https://gerrit.wikimedia.org/r/1289897 [11:27:21] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:27:21] (03CR) 10Klausman: [V:03+2 C:03+2] home/klausman: Add kubectl script and tweak tmuxp recipe to use it [puppet] - 10https://gerrit.wikimedia.org/r/1289897 (owner: 10Klausman) [11:27:39] RECOVERY - MariaDB Replica IO: ms2 on db1151 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [11:27:43] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:30:08] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:30:36] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:30:50] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:31:05] (03PS1) 10Arnaudb: ssh-client-config: add gitlab-ssh.wikimedia.org [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1300101 (https://phabricator.wikimedia.org/T425441) [11:31:06] (03Merged) 10jenkins-bot: ml-services: add liftwing-openapi-server latest version deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300089 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:31:10] FIRING: [3x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:31:20] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:32:48] (03PS1) 10Reedy: Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups [extensions/OATHAuth] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300102 (https://phabricator.wikimedia.org/T420792) [11:33:08] !log root@cumin1003 START - Cookbook sre.mysql.pool pool db1151: Security updates [11:33:08] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [11:33:21] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:33:21] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1151: Security updates [11:33:40] jouncebot: nowandnext [11:33:40] For the next 0 hour(s) and 26 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1100) [11:33:40] In 1 hour(s) and 26 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1300) [11:34:00] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'liftwing-openapi-server' for release 'main' . [11:34:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5004.eqsin.wmnet with OS bookworm [11:34:37] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'liftwing-openapi-server' for release 'main' . [11:34:46] !log jmm@cumin2002 START - Cookbook sre.hosts.move-vlan for host ganeti5004 [11:35:13] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'liftwing-openapi-server' for release 'main' . [11:37:26] (03PS18) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [11:37:49] jmm@cumin2002 reimage (PID 1713265) is awaiting input [11:38:05] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:41:42] (03CR) 10Elukey: "@jhathaway@wikimedia.org thanks for the tests, I've refactored all the redfish_test.py code to be inside RedfishTest, it didn't make much " [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [11:44:04] jmm@cumin2002 reimage (PID 1713265) is awaiting input [11:45:30] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:45:50] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1186: Upgrading db1186.eqiad.wmnet [11:46:10] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1186: Upgrading db1186.eqiad.wmnet [11:47:45] (03CR) 10Reedy: [C:03+2] wmf-config: Add $wmgOATHAuthRequire2FAForAll config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299643 (https://phabricator.wikimedia.org/T420792) (owner: 10Reedy) [11:48:27] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS trixie [11:48:52] (03Merged) 10jenkins-bot: wmf-config: Add $wmgOATHAuthRequire2FAForAll config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299643 (https://phabricator.wikimedia.org/T420792) (owner: 10Reedy) [11:48:52] (03PS1) 10Reedy: Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups [extensions/OATHAuth] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300104 (https://phabricator.wikimedia.org/T420792) [11:49:05] (03CR) 10Reedy: [C:03+2] Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups [extensions/OATHAuth] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300102 (https://phabricator.wikimedia.org/T420792) (owner: 10Reedy) [11:49:11] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:49:19] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ganeti5004 - jmm@cumin2002" [11:49:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ganeti5004 - jmm@cumin2002" [11:49:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:25] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ganeti5004.eqsin.wmnet 40.0.132.10.in-addr.arpa 0.4.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [11:49:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti5004.eqsin.wmnet 40.0.132.10.in-addr.arpa 0.4.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [11:49:30] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5004 [11:49:31] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2170: Upgrading db2170.codfw.wmnet [11:49:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2170: Upgrading db2170.codfw.wmnet [11:51:00] (03Merged) 10jenkins-bot: Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups [extensions/OATHAuth] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300102 (https://phabricator.wikimedia.org/T420792) (owner: 10Reedy) [11:51:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti5004 [11:51:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ganeti5004 [11:53:20] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2170.codfw.wmnet with OS trixie [11:54:24] (03CR) 10Reedy: [C:03+2] Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups [extensions/OATHAuth] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300104 (https://phabricator.wikimedia.org/T420792) (owner: 10Reedy) [11:55:50] (03Merged) 10jenkins-bot: Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups [extensions/OATHAuth] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300104 (https://phabricator.wikimedia.org/T420792) (owner: 10Reedy) [11:57:02] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1300104|Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups (T420792)]], [[gerrit:1300102|Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups (T420792)]], [[gerrit:1299643|wmf-config: Add $wmgOATHAuthRequire2FAForAll config (T420792)]] [11:57:06] T420792: Allow 2FA to be enforced for all accounts on a private wiki - https://phabricator.wikimedia.org/T420792 [11:57:33] (03CR) 10Mszwarc: "Just noting that this will also require 2FA from all users on wikis such as arbcom wikis, steward wiki, CU wiki etc., that are not listed " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299644 (https://phabricator.wikimedia.org/T428103) (owner: 10Reedy) [11:59:09] !log reedy@deploy1003 reedy: Backport for [[gerrit:1300104|Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups (T420792)]], [[gerrit:1300102|Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups (T420792)]], [[gerrit:1299643|wmf-config: Add $wmgOATHAuthRequire2FAForAll config (T420792)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes c [11:59:09] an now be verified there. [12:02:10] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage [12:03:18] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable ULS v2 on group0 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [12:03:43] (03CR) 10Reedy: "I... didn't make the list, so if there are wikis missing ("other phases TBD)... :popcorn:." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299644 (https://phabricator.wikimedia.org/T428103) (owner: 10Reedy) [12:03:50] !log reedy@deploy1003 reedy: Continuing with deployment [12:04:59] 06SRE, 06Data-Persistence, 06DBA, 13Patch-For-Review: Build wmfdb-admin for Trixie - https://phabricator.wikimedia.org/T427900#12004423 (10MoritzMuehlenhoff) @FCeratto-WMF Can you please import them to the "main" component of apt.wikimedia.org? [12:06:35] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage [12:08:08] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300104|Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups (T420792)]], [[gerrit:1300102|Mandatory2FAChecker: Allow getGroupsRequiring2FA() to work on implicit groups (T420792)]], [[gerrit:1299643|wmf-config: Add $wmgOATHAuthRequire2FAForAll config (T420792)]] (duration: 11m 06s) [12:08:13] T420792: Allow 2FA to be enforced for all accounts on a private wiki - https://phabricator.wikimedia.org/T420792 [12:11:56] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2170.codfw.wmnet with reason: host reimage [12:12:57] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host rdb1014.eqiad.wmnet with OS trixie [12:13:06] 06SRE, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#12004452 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1003 for host rdb1014.eqiad.wmnet with OS trixie [12:13:23] !log jiji@cumin1003 START - Cookbook sre.hosts.move-vlan for host rdb1014 [12:16:15] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [12:17:12] (03CR) 10Mszwarc: "I don't have a strong opinion on that, and whatever comms says, would be okay for me. I wanted to primarily raise that the set of wikis af" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299644 (https://phabricator.wikimedia.org/T428103) (owner: 10Reedy) [12:19:02] (03CR) 10Reedy: "I updated the task description to make it much more obvious/explicit what "everything else" looks like - https://phabricator.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299644 (https://phabricator.wikimedia.org/T428103) (owner: 10Reedy) [12:19:21] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2170.codfw.wmnet with reason: host reimage [12:20:23] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage [12:21:15] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host rdb1014 - jiji@cumin1003" [12:21:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host rdb1014 - jiji@cumin1003" [12:21:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:21:20] !log jiji@cumin1003 START - Cookbook sre.dns.wipe-cache rdb1014.eqiad.wmnet 42.48.64.10.in-addr.arpa 2.4.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:21:24] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) rdb1014.eqiad.wmnet 42.48.64.10.in-addr.arpa 2.4.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:21:25] !log jiji@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host rdb1014 [12:23:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1186.eqiad.wmnet with OS trixie [12:24:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb1014 [12:24:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host rdb1014 [12:24:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage [12:26:58] (03CR) 10Cathal Mooney: [C:03+2] QoS: Move DSCP AF41 from 'low' to 'normal' priority class [homer/public] - 10https://gerrit.wikimedia.org/r/1285350 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney) [12:28:36] (03Merged) 10jenkins-bot: QoS: Move DSCP AF41 from 'low' to 'normal' priority class [homer/public] - 10https://gerrit.wikimedia.org/r/1285350 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney) [12:30:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [12:30:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for  - https://phabricator.wikimedia.org/T427553#12004518 (10APDube-WMF) Hi @RLazarus - thanks for granting access! While I am able to log into Superset now - I still cannot view the dashboard. It still shows "a... [12:31:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T427553#12004521 (10APDube-WMF) 05Resolved→03Open [12:31:51] (03PS1) 10Muehlenhoff: Update account meta data for okryva [puppet] - 10https://gerrit.wikimedia.org/r/1300113 [12:32:31] FIRING: Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:32:38] (03CR) 10CI reject: [V:04-1] Update account meta data for okryva [puppet] - 10https://gerrit.wikimedia.org/r/1300113 (owner: 10Muehlenhoff) [12:32:40] (03PS1) 10Effie Mouzeli: mediawiki_common: update IP for rdb1014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300114 (https://phabricator.wikimedia.org/T421711) [12:33:19] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1186: Migration of db1186.eqiad.wmnet completed [12:35:38] (03PS1) 10JMeybohm: aptrepo: Add components for calico, istio and kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1300115 (https://phabricator.wikimedia.org/T427069) [12:36:52] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2170.codfw.wmnet with OS trixie [12:37:31] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:38:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299516 (owner: 10Sbisson) [12:39:38] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:39:41] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1300115 (https://phabricator.wikimedia.org/T427069) (owner: 10JMeybohm) [12:41:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:41:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:41:44] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1014.eqiad.wmnet with reason: host reimage [12:42:33] !log re-map DSCP AF41 from 'low' to 'normal' priority qos class on network T424640 [12:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:37] T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640 [12:44:55] it's been ages since I've done this, so figured I'd doublecheck: is it still fine to just merge InitialiseSettings-labs.php-only changes, or should these also be "deployed"? [12:45:01] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1014.eqiad.wmnet with reason: host reimage [12:46:27] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [12:46:31] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [12:46:35] (03CR) 10Jgiannelos: [C:03+1] "I copied locally and verified the RC release @dziewonski@fastmail.fm shared and the signature were verified using the key published in thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299614 (https://phabricator.wikimedia.org/T423267) (owner: 10Bartosz Dziewoński) [12:46:41] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [12:46:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5004.eqsin.wmnet with OS bookworm [12:46:48] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [12:47:44] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2170: Migration of db2170.codfw.wmnet completed [12:48:32] matthiasmullie: preferably you would at least pull them to the prod deployment host [12:48:33] (03PS3) 10Abijeet Patro: Enable ULS v2 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 [12:48:38] I think they’ll show up as “unexpected undeployed” otherwise [12:48:45] (03CR) 10Abijeet Patro: Enable ULS v2 on group0 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [12:49:00] (and you should also be able to `scap backport` / SpiderPig them and it’ll figure out on its own that no deploy is needed, and exit after the git pull) [12:50:49] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable ULS v2 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [12:51:10] Lucas_WMDE: perfect, thanks! [12:52:31] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:52:33] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:17] (03PS1) 10Brouberol: airflow: add ARROW_LIBHDFS_DIR/LD_LIBRARY_PATH to the of hadoop env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300131 (https://phabricator.wikimedia.org/T428099) [12:57:24] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300131 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [12:57:31] RESOLVED: Traffic bill over quota: Alert for device cr2-esams.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [13:00:00] o/ [13:00:02] (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1300132 [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1300) [13:00:05] abijeet, topranks, and stephanebisson: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:23] * topranks here [13:00:24] o/ I’m in a meeting for 30 more minutes, so if someone else can deploy that would be great [13:00:54] abijeet do you need someone to deploy your patch? [13:01:10] stephanebisson, yea that would be great [13:01:17] abijeet I can do it [13:01:25] thanks! [13:01:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [13:02:29] (03Merged) 10jenkins-bot: Enable ULS v2 on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299676 (owner: 10Abijeet Patro) [13:02:56] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1299676|Enable ULS v2 on group0 wikis]] [13:03:01] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1014.eqiad.wmnet with OS trixie [13:03:08] 06SRE, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#12004579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1003 for host rdb1014.eqiad.wmnet with OS trixie completed: - rdb101... [13:05:04] !log sbisson@deploy1003 sbisson, abi: Backport for [[gerrit:1299676|Enable ULS v2 on group0 wikis]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:05:35] abijeet ready to test [13:05:48] stephanebisson, thanks checking. [13:06:04] (03CR) 10Ayounsi: [C:03+1] mediawiki_common: update IP for rdb1014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300114 (https://phabricator.wikimedia.org/T421711) (owner: 10Effie Mouzeli) [13:08:33] that's weird. I'm checking on https://he.wikipedia.org/wiki/%D7%90%D7%95%D7%A8%D7%99%D7%92%D7%9E%D7%99 which is in group0. I run `mw.config.get( 'wgULSLanguageSelectorV2Enabled' )` in the console and it returns false. [13:10:22] abijeet: https://versions.toolforge.org/ claims hewiki is in group1 [13:10:36] uh, sorry. Thanks [13:10:36] abijeet testwiki is in group0 [13:10:39] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-restart-reboot-tcp-proxy rolling restart_daemons on A:tcpproxy and A:tcpproxy [13:10:51] (03PS1) 10Atsuko: aptrepo: Add thirdparty/opensearch1 to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1300135 (https://phabricator.wikimedia.org/T418809) [13:14:16] stephanebisson, looks good. [13:15:00] abijeet I see wgULSLanguageSelectorV2Enabled already true even if I don't fully deploy... [13:15:33] I'll proceed with the deployment anyway [13:15:38] !log sbisson@deploy1003 sbisson, abi: Continuing with deployment [13:15:43] * Lucas_WMDE meeting done [13:16:01] stephanebisson, yes, that will happen if you have the beta feature enabled. If you test without being logged in, it should be false on group1 wikis but with the config enabled it'll be true. [13:16:13] stephanebisson: I get false/true depending on WikimediaDebug state fwiw [13:17:13] (03PS1) 10JMeybohm: Add 1.29.4, drop 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1300138 (https://phabricator.wikimedia.org/T427401) [13:18:07] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [13:18:11] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [13:18:17] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [13:18:24] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [13:18:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1186: Migration of db1186.eqiad.wmnet completed [13:18:50] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:19:34] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [13:19:56] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299676|Enable ULS v2 on group0 wikis]] (duration: 17m 00s) [13:20:17] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-restart-reboot-hcaptcha-proxy rolling restart_daemons on A:hcaptcha-proxy and A:hcaptcha-proxy [13:20:40] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-durum rolling restart_daemons on A:durum and A:durum [13:21:14] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300135 (https://phabricator.wikimedia.org/T418809) (owner: 10Atsuko) [13:21:53] (03PS1) 10Eevans: data-gateway: deploy v1.0.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300140 (https://phabricator.wikimedia.org/T424386) [13:22:17] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and (A:dnsbox) [13:22:52] topranks can you deploy your patch? [13:23:23] sort of?? it's my first deploy rodeo so I'm not 100% sure what's needed... [13:23:30] I can certainly merge the patch in gerrit [13:24:04] PROBLEM - Host cp5018 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:19] topranks: ideally you would log in at https://spiderpig.wikimedia.org/ and deploy the change there [13:24:36] Lucas_WMDE: looking [13:24:40] topranks don't merge the patch. I suggest you try to get training and make sure you have the right access [13:24:42] uh? [13:24:44] (03CR) 10Eevans: [C:03+2] data-gateway: deploy v1.0.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300140 (https://phabricator.wikimedia.org/T424386) (owner: 10Eevans) [13:24:45] cp5018 [13:24:48] sukhe@cumin1003 roll-restart-reboot-tcp-proxy (PID 2846626) is awaiting input [13:25:06] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5018.eqsin.wmnet,service=(cdn|ats-be) [13:26:45] stephanebisson81: is anyone else able to assist? there are some network changes blocked on this being merged, I apologise though I sort of got left holding the can here [13:26:50] sukhe: that is my fault [13:26:52] cp5018 [13:26:56] (03Merged) 10jenkins-bot: data-gateway: deploy v1.0.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300140 (https://phabricator.wikimedia.org/T424386) (owner: 10Eevans) [13:27:01] topranks: oh ok no worries! I depooled [13:27:07] I thought it was already down, I couldn't ping [13:27:18] Lucas_WMDE do you want to deploy or should I? [13:27:26] I don’t mind either way :) [13:27:30] (03PS1) 10Muehlenhoff: cumin2003: Add host Hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1300141 (https://phabricator.wikimedia.org/T427897) [13:27:31] Go for it [13:27:32] netbox has updated IPs for it, I think there was an aborted reimage / vlan move for it that's left it's state inconsistent [13:27:37] ok, can do [13:27:56] stephanebisson81, Lukas_WMDE: thank you <3 [13:27:58] (03PS2) 10Muehlenhoff: cumin2003: Add host Hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1300141 (https://phabricator.wikimedia.org/T427897) [13:28:02] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-tcp-proxy (exit_code=0) rolling restart_daemons on A:tcpproxy and A:tcpproxy [13:28:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [13:28:36] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5018.eqsin.wmnet with reason: host down [13:29:10] (03Merged) 10jenkins-bot: wmf-config: Update private subnets to include additions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [13:29:24] (03CR) 10Bking: [C:03+1] aptrepo: Add thirdparty/opensearch1 to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1300135 (https://phabricator.wikimedia.org/T418809) (owner: 10Atsuko) [13:29:37] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1297237|wmf-config: Update private subnets to include additions (T427393)]] [13:29:42] T427393: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393 [13:31:21] (03PS1) 10Jelto: Build helm3.19 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1300145 (https://phabricator.wikimedia.org/T427403) [13:31:37] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: apply [13:31:41] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, brett: Backport for [[gerrit:1297237|wmf-config: Update private subnets to include additions (T427393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:50] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [13:31:56] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement, and 2 others: decommission deploy2002.codfw.wmnet - https://phabricator.wikimedia.org/T426222#12004717 (10Raine) [13:32:00] topranks: anything to test for this change on WikimediaDebug? [13:32:08] (I suspect the answer might be “no” ^^) [13:32:23] Lucas_WMDE: no not really [13:32:31] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [13:32:34] we've not removed anything, just added new ranges that will be in use shortly [13:32:38] (03PS2) 10Jelto: Build helm3.19 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1300145 (https://phabricator.wikimedia.org/T427403) [13:32:38] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, brett: Continuing with deployment [13:32:42] alright, thanks [13:32:47] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:32:51] I think overall fairly safe, the config is auto generated too [13:32:52] thanks [13:33:07] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1195: Upgrading db1195.eqiad.wmnet [13:33:13] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2170: Migration of db2170.codfw.wmnet completed [13:33:14] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:33:34] (03CR) 10Brouberol: [C:03+2] airflow: add ARROW_LIBHDFS_DIR/LD_LIBRARY_PATH to the of hadoop env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300131 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [13:33:35] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling restart_daemons on A:durum and A:durum [13:33:35] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-hcaptcha-proxy (exit_code=0) rolling restart_daemons on A:hcaptcha-proxy and A:hcaptcha-proxy [13:33:37] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1195: Upgrading db1195.eqiad.wmnet [13:34:40] (03CR) 10Lucas Werkmeister (WMDE): ArticleGuidance: restrict beta deployment to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299516 (owner: 10Sbisson) [13:34:43] (03CR) 10Atsuko: [C:03+2] aptrepo: Add thirdparty/opensearch1 to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1300135 (https://phabricator.wikimedia.org/T418809) (owner: 10Atsuko) [13:36:37] cwilliams@cumin1003 major-upgrade (PID 2864054) is awaiting input [13:36:58] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297237|wmf-config: Update private subnets to include additions (T427393)]] (duration: 07m 20s) [13:37:02] T427393: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393 [13:37:28] stephanebisson81: over to you :) [13:38:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299516 (owner: 10Sbisson) [13:38:18] (03PS17) 10Daniel Kinzler: rest gateway: implement cost-based rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) [13:38:32] RECOVERY - Host cp5018 is UP: PING OK - Packet loss = 0%, RTA = 251.07 ms [13:39:12] (03Merged) 10jenkins-bot: ArticleGuidance: restrict beta deployment to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299516 (owner: 10Sbisson) [13:39:16] thanks for the help with the deploy folks!! [13:39:25] np :) [13:39:39] sukhe: I undid the change homer pushed so cp5018 is reachable on the IP the host is currently configured with [13:39:53] that IP doesn't match dns/netbox so idk, but anyway icinga is happy [13:40:29] topranks: thanks, I will check after the meeting. it is depooled for now yeah [13:40:39] > < topranks> that IP doesn't match dns/netbox so idk, [13:40:46] by this you mean that you are surprised why it broke with the change? [13:40:52] or that someting is wrong and we need to look? [13:41:01] (03CR) 10Jcrespo: [C:03+2] dbbackups: Reenable regular es backups and update RO job ids [puppet] - 10https://gerrit.wikimedia.org/r/1300041 (https://phabricator.wikimedia.org/T427357) (owner: 10Jcrespo) [13:41:27] sukhe: nah I'm not surprised, I tried to ping it's old IP before I pushed the change to see if pushing it through would be an issue, but I ended up pinging the wrong IP [13:41:54] I'm aware of what's going on with it, brett was going to reimage but held off pending the mw prefix-list update that was just deployed to be cautious [13:42:01] (03PS1) 10Brouberol: airflow: upgrade the image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300147 (https://phabricator.wikimedia.org/T428099) [13:42:05] stephanebisson81: are you still there? [13:42:15] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS trixie [13:42:15] Lucas_WMDE yes? [13:42:22] sukhe: but obviously the process was started and now it's sort-of in limbo, nothing tricky we can tidy it up later [13:42:33] ah, sorry, I didn’t see the deployment [13:42:44] then we’re all done, I think? [13:42:47] Lucas_WMDE all done [13:42:57] !log UTC afternoon backport+config window done [13:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:09] topranks: thanks, noted [13:44:16] I will check with him when he is up [13:44:20] (03PS1) 10Muehlenhoff: Remove access for atieno [puppet] - 10https://gerrit.wikimedia.org/r/1300148 [13:45:27] sukhe: thanks, I'll be around later if anything is needed, but I'll be out for an hour or two in his AM. If you can pass on that the mw change was deployed so we have a green light now to proceed that'd be great [13:45:35] topranks: thanks, noted! [13:47:29] (03PS2) 10JMeybohm: Add 1.29.4, drop 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1300138 (https://phabricator.wikimedia.org/T427401) [13:47:42] (03PS2) 10Daniel Kinzler: rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) [13:48:47] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1300138 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [13:50:40] (03CR) 10JMeybohm: [C:03+2] aptrepo: Add components for calico, istio and kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1300115 (https://phabricator.wikimedia.org/T427069) (owner: 10JMeybohm) [13:50:58] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-06-06-013944 to 2026-06-09-174730 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300149 [13:50:58] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-06-03-020126 to 2026-06-09-215338 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300150 (https://phabricator.wikimedia.org/T419774) [13:51:55] (03CR) 10Muehlenhoff: [C:03+2] Remove access for atieno [puppet] - 10https://gerrit.wikimedia.org/r/1300148 (owner: 10Muehlenhoff) [13:54:18] (03PS1) 10Cathal Mooney: wmf-plugin: temp excpetion for new vlan names in eqsin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1300152 (https://phabricator.wikimedia.org/T428229) [13:54:47] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Atieno out of all services on: 2436 hosts [13:56:28] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage [13:58:06] (03CR) 10Btullis: [C:03+1] airflow: upgrade the image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300147 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [13:58:30] (03CR) 10Brouberol: [C:03+2] airflow: upgrade the image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300147 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [13:58:38] !log atsuko@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/ttmserver-export.php --wiki=default --ttmserver eqiad-test # T425377 populating production index on test cluster to estimate time required for the release [13:58:42] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [13:58:44] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:59:01] (03CR) 10MSantos: [C:03+2] Add my public key to mediawiki.org/keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299614 (https://phabricator.wikimedia.org/T423267) (owner: 10Bartosz Dziewoński) [13:59:05] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2173: Upgrading db2173.codfw.wmnet [13:59:06] (03CR) 10JMeybohm: [C:03+1] ratelimit-media: policy and user-class level metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [13:59:27] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2173: Upgrading db2173.codfw.wmnet [13:59:49] (03PS1) 10Krinkle: Disable ShortUrl extension on bdwikimedia, bhwiki, bnwiki, eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300154 (https://phabricator.wikimedia.org/T107188) [13:59:58] (03Merged) 10jenkins-bot: Add my public key to mediawiki.org/keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299614 (https://phabricator.wikimedia.org/T423267) (owner: 10Bartosz Dziewoński) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1400) [14:00:10] (03PS2) 10Krinkle: Disable ShortUrl extension on bdwikimedia, bhwiki, bnwiki, eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300154 (https://phabricator.wikimedia.org/T107188) [14:00:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:00:44] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage [14:00:53] (03CR) 10MSantos: Add my public key to mediawiki.org/keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299614 (https://phabricator.wikimedia.org/T423267) (owner: 10Bartosz Dziewoński) [14:01:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:02:25] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2173.codfw.wmnet with OS trixie [14:02:33] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:53] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#12004846 (10VRiley-WMF) Hey @Marostegui, as it turns out, I am not able to find a compatible processor for this unit. Should we commence with the removal of this unit? [14:03:06] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-06-06-013944 to 2026-06-09-174730 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300149 (owner: 10Jforrester) [14:03:46] (03CR) 10Ayounsi: [C:03+1] wmf-plugin: temp excpetion for new vlan names in eqsin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1300152 (https://phabricator.wikimedia.org/T428229) (owner: 10Cathal Mooney) [14:05:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [14:05:48] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-06-06-013944 to 2026-06-09-174730 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300149 (owner: 10Jforrester) [14:06:04] (03CR) 10JMeybohm: [C:03+1] "Nice hack" [puppet] - 10https://gerrit.wikimedia.org/r/1300077 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [14:06:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [14:07:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:07:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:07:51] (03PS1) 10AOkoth: mariadb: add grants for phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1300156 (https://phabricator.wikimedia.org/T423727) [14:07:58] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:08:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [14:08:23] (03CR) 10CI reject: [V:04-1] mariadb: add grants for phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1300156 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [14:08:50] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:09:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [14:09:10] (03PS2) 10AOkoth: mariadb: add grants for phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1300156 (https://phabricator.wikimedia.org/T423727) [14:09:47] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:10:28] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:10:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Degraded RAID on an-worker1201 - https://phabricator.wikimedia.org/T428571#12004896 (10Jclark-ctr) These Disk will arrive today. I would like to swap them today or tomorrow @BTullis @RKemper [14:11:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:11:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [14:12:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [14:13:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:13:01] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:13:07] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:13:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:14:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:14:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:15:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:15:32] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-06-03-020126 to 2026-06-09-215338 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300150 (https://phabricator.wikimedia.org/T419774) (owner: 10Jforrester) [14:15:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:15:37] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:16:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [14:17:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [14:17:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:17:28] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 2 (backup1013, ...), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:17:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:17:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [14:17:55] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-06-03-020126 to 2026-06-09-215338 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300150 (https://phabricator.wikimedia.org/T419774) (owner: 10Jforrester) [14:17:55] ^ that is expected, all is good but it is an artifact of some hacking [14:18:01] it will go away tonight [14:18:04] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1195.eqiad.wmnet with OS trixie [14:18:23] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:18:26] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 42%, RTA = 5053.00 ms [14:18:28] (03CR) 10Clément Goubert: [C:03+2] ratelimit-media: policy and user-class level metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [14:18:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [14:18:42] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:18:51] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:19:25] (03PS6) 10Daniel Kinzler: rest gateway: per-policy upfront cost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) [14:19:35] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:19:41] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:20:03] FIRING: ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:12] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:20:28] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:20:33] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2173.codfw.wmnet with reason: host reimage [14:20:49] (03Merged) 10jenkins-bot: ratelimit-media: policy and user-class level metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [14:20:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and (A:dnsbox) [14:21:00] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [14:22:45] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [14:22:51] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [14:22:56] (03PS13) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [14:23:12] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [14:24:20] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T415109#12004944 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Closing this for now. [14:24:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2173.codfw.wmnet with reason: host reimage [14:24:54] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist translate extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release, now with dblist translate [14:24:59] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [14:25:03] RESOLVED: ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:47] (03CR) 10Cathal Mooney: [C:03+2] wmf-plugin: temp excpetion for new vlan names in eqsin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1300152 (https://phabricator.wikimedia.org/T428229) (owner: 10Cathal Mooney) [14:26:20] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [14:26:38] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [14:27:08] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:28:16] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin[2002-2003].codfw.wmnet,cumin1003.eqiad.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [14:29:03] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12004994 (10VRiley-WMF) 05Open→03In progress Starting work on these data platform sre - dse-k8s-worker1009, an-conf1004, an-conf1005, an-conf1006, cloudelastic1011, cloudelastic1012 [14:29:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1195: Migration of db1195.eqiad.wmnet completed [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1430) [14:30:40] PROBLEM - Host dse-k8s-worker1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:27] cmooney@cumin1003 python-code (PID 2906062) is awaiting input [14:33:09] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:34:21] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) homer to cumin[2002-2003].codfw.wmnet,cumin1003.eqiad.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [14:34:23] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin[2002-2003].codfw.wmnet,cumin1003.eqiad.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [14:34:32] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:34:55] hi yall, i might need some help. it looks like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1299614 got +2'd and merged, but not deployed. can anyone help me remedy that? [14:36:42] MatmaRex: 👀 [14:36:44] jouncebot: nowandnext [14:36:44] For the next 0 hour(s) and 23 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1400) [14:36:44] For the next 0 hour(s) and 23 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1430) [14:36:44] In 2 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1700) [14:36:55] Clear at my end. [14:36:56] I was thinking of deploying something anyway (assuming Wikifunctions / Test Kitchen are okay with it) [14:36:59] thx James_F [14:37:36] MatmaRex: yeah the latest change on deploy1003 is “ArticleGuidance: restrict beta deployment to enwiki” :/ [14:38:01] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox and (A:dnsbox) [14:38:08] * Lucas_WMDE spiderpigs it [14:38:15] cmooney@cumin1003 python-code (PID 2911890) is awaiting input [14:38:22] a bit of miscommunication with Mateus. i don't think he's here, but i let him know on Slack that i'll try to get it deployed [14:38:25] thanks Lucas_WMDE [14:38:27] hopefully scap knows what to do with docroot changes [14:38:29] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1299614|Add my public key to mediawiki.org/keys (T423267)]] [14:38:34] T423267: Release MW 1.46.0-rc.0 - https://phabricator.wikimedia.org/T423267 [14:39:00] (03PS2) 10Muehlenhoff: Update account meta data for okryva [puppet] - 10https://gerrit.wikimedia.org/r/1300113 [14:39:05] (03PS1) 10Lucas Werkmeister (WMDE): Fix snak value display for rtl languages [extensions/Wikibase] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300168 (https://phabricator.wikimedia.org/T360854) [14:39:20] (03PS1) 10Lucas Werkmeister (WMDE): Fix snak value display for rtl languages [extensions/Wikibase] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300169 (https://phabricator.wikimedia.org/T360854) [14:39:32] ^ ^ those are the two changes I’d like to cherry-pick if it’s okay with everyone [14:39:37] (once the current spiderpig is done, that is) [14:39:45] I’ll get started on the gate-and-submit already [14:39:50] but feel free to interrupt [14:39:59] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300169 (https://phabricator.wikimedia.org/T360854) (owner: 10Lucas Werkmeister (WMDE)) [14:39:59] (03PS3) 10Krinkle: Disable ShortUrl extension on bdwikimedia, bhwiki, bnwiki, bnwikisource, eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300154 (https://phabricator.wikimedia.org/T107188) [14:40:07] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300168 (https://phabricator.wikimedia.org/T360854) (owner: 10Lucas Werkmeister (WMDE)) [14:40:08] FIRING: [3x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:40:14] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) homer to cumin[2002-2003].codfw.wmnet,cumin1003.eqiad.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [14:40:32] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin[2002-2003].codfw.wmnet,cumin1003.eqiad.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [14:40:36] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1299614|Add my public key to mediawiki.org/keys (T423267)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:40:50] MatmaRex: I’m not sure if WikimediaDebug works on this change but try it? ^^ [14:41:08] (03CR) 10Ssingh: [C:03+1] wikimedia.org: Introduce thumb.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1298821 (https://phabricator.wikimedia.org/T427465) (owner: 10Ladsgroup) [14:41:11] ooh, it seems to work on my end at least [14:41:17] yeah, seems to work [14:41:29] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2173.codfw.wmnet with OS trixie [14:41:38] hmm, if you view https://www.mediawiki.org/keys/keys.txt in the browser, does my last name look mangled to you as well? [14:41:52] yup [14:41:59] I expect Firefox is just guessing an encoding [14:42:10] and probably just guessing it based on the first X bytes which don’t go that far down [14:42:13] yeah [14:42:25] i wonder if we could serve it iwth the rght content-type or something. but we can figure that out later [14:42:35] content-type: text/plain – no utf8 in sight [14:42:41] yeah, let’s sync this first [14:42:44] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Continuing with deployment [14:43:23] I’m slightly surprised you’re the first “victim” in that file, but all the other names don’t look mangled [14:43:33] (but also, oh god, bvibber wrong name spotted /o\) [14:43:58] Lucas_WMDE: that might sadly be intentional, i think it's supposed to match the key metadata :/ [14:44:04] :S [14:44:23] cmooney@cumin1003 python-code (PID 2915977) is awaiting input [14:44:27] but maybe it doesn't have to. i don't know how important that is [14:45:00] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:45:01] afaik, gpg ignores that part, it's just for humans [14:45:03] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:14] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) homer to cumin[2002-2003].codfw.wmnet,cumin1003.eqiad.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [14:47:03] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299614|Add my public key to mediawiki.org/keys (T423267)]] (duration: 08m 33s) [14:47:07] T423267: Release MW 1.46.0-rc.0 - https://phabricator.wikimedia.org/T423267 [14:47:43] i can accure you any gpg key i have in that file is long since expired and the private key lost ;) [14:48:01] I guess it’s just for historical reference then [14:48:13] for the handful of weirdos who want to verify the old tarballs :) [14:48:21] content-type: text/plain [14:48:24] hehe [14:48:24] idk if any tooling would complain if the displayed name was updated [14:48:29] should presumably be charset=utf-8 [14:48:30] Yes. [14:48:33] !log ayounsi@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - ayounsi@cumin1003 [14:49:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300169 (https://phabricator.wikimedia.org/T360854) (owner: 10Lucas Werkmeister (WMDE)) [14:49:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300168 (https://phabricator.wikimedia.org/T360854) (owner: 10Lucas Werkmeister (WMDE)) [14:49:12] thanks for deploying Lucas_WMDE, you're the best [14:49:16] np :) [14:49:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - ayounsi@cumin1003 [14:50:03] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:25] !log ayounsi@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2003.codfw.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - ayounsi@cumin1003 [14:50:38] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) homer to cumin2003.codfw.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - ayounsi@cumin1003 [14:51:37] hmm. i wonder if we need to purge some cache? i'm still getting the old version if i curl or wget https://www.mediawiki.org/keys/keys.txt [14:51:57] looking [14:52:08] 06SRE, 06ServiceOps new, 10ServiceOps-Services-Oids, 10Thumbor: Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445#12005124 (10MLechvien-WMF) a:03JTweed-WMF Reassigning to @JTweed-WMF for visibility and triaging [14:52:15] I see your name in there when I curl it [14:52:25] 10SRE-swift-storage, 06Data-Persistence, 10Prod-Kubernetes, 06ServiceOps new, and 5 others: Fix thumbor discovery records and make swift use them - https://phabricator.wikimedia.org/T397618#12005126 (10MLechvien-WMF) a:03JTweed-WMF [14:52:26] (and also x-cache(-status) miss [14:52:29] ) [14:52:42] MatmaRex: do you get a cache hit in the response headers? [14:52:59] last-modified: Thu, 26 Mar 2026 18:44:49 GMT [14:52:59] x-cache: cp3069 miss, cp3069 hit/7 [14:52:59] x-cache-status: hit-front [14:53:03] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2173: Migration of db2173.codfw.wmnet completed [14:53:03] ok [14:53:09] I guess I’ll try a purge then [14:53:39] should purge https://www.mediawiki.org/keys/keys.txt, https://www.mediawiki.org/keys/keys.html, and https://www.mediawiki.org/keys/ [14:54:14] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist translate extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release, now with correct schema [14:54:19] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [14:54:30] (03CR) 10Alex Paskulin: [C:03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300073 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [14:55:23] !log lucaswerkmeister-wmde@deploy1003 $ printf 'https://www.mediawiki.org/keys/%s\n' '' 'keys.txt' 'keys.html' | mwscript-k8s --attach --comment=T423267 purgeList mediawikiwiki [14:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:27] T423267: Release MW 1.46.0-rc.0 - https://phabricator.wikimedia.org/T423267 [14:55:35] MatmaRex: better now? (not sure if that script needed to run on mediawikiwiki or enwiki) [14:56:20] (03PS1) 10Santiago Faci: Deploy GrowthBook 4.4.0 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300173 (https://phabricator.wikimedia.org/T427506) [14:56:31] Lucas_WMDE: yup! [14:56:34] thanks [14:56:34] \o/ [14:56:43] for future reference, what did you do? [14:56:52] the command I logged above ^^ [14:57:05] oh right, i missed it [14:57:07] thanks! [14:57:07] pipe the three URLs into purgeList [14:57:09] np [14:58:33] (03Merged) 10jenkins-bot: Fix snak value display for rtl languages [extensions/Wikibase] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300169 (https://phabricator.wikimedia.org/T360854) (owner: 10Lucas Werkmeister (WMDE)) [14:58:35] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [14:58:41] I’ll definitely run into the test kitchen window, sorry 😬 [14:58:51] * Lucas_WMDE wills castor-save-workspace-cache to finish faster [14:58:52] (03Merged) 10jenkins-bot: Fix snak value display for rtl languages [extensions/Wikibase] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300168 (https://phabricator.wikimedia.org/T360854) (owner: 10Lucas Werkmeister (WMDE)) [14:58:55] oh hey it worked [14:59:22] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1300169|Fix snak value display for rtl languages (T360854)]], [[gerrit:1300168|Fix snak value display for rtl languages (T360854)]] [14:59:23] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [14:59:26] T360854: [MEX] M5 - Statement values are left-aligned for rtl languages on the mobile site - https://phabricator.wikimedia.org/T360854 [15:00:14] (03CR) 10Clément Goubert: [C:03+1] wgRestSandboxSpecs: Add Lift Wing API to documentation wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300073 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [15:01:39] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1300169|Fix snak value display for rtl languages (T360854)]], [[gerrit:1300168|Fix snak value display for rtl languages (T360854)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:01:45] checking… [15:02:55] (03PS4) 10Krinkle: Disable ShortUrl on bdwikimedia, bhwiki, bnwiki, bnwikisource, eswikibooks, gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300154 (https://phabricator.wikimedia.org/T107188) [15:03:42] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with deployment [15:03:45] looks better than before [15:04:16] oh, the test kitchen window *ended* at 15:00 UTC, I thought it started then [15:04:24] I could’ve just waited for my deploy then [15:04:25] sorry [15:04:47] (03CR) 10Elukey: [C:03+1] cumin2003: Add host Hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1300141 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [15:08:01] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300169|Fix snak value display for rtl languages (T360854)]], [[gerrit:1300168|Fix snak value display for rtl languages (T360854)]] (duration: 08m 39s) [15:08:06] T360854: [MEX] M5 - Statement values are left-aligned for rtl languages on the mobile site - https://phabricator.wikimedia.org/T360854 [15:08:13] * Lucas_WMDE done deploying [15:11:20] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin1003.codfw.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [15:11:36] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) homer to cumin1003.codfw.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [15:11:41] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin1003.eqiad.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [15:12:30] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin1003.eqiad.wmnet with reason: add new eqsin vlans as legacy temp workaround in wmf-plugin.py - cmooney@cumin1003 [15:14:07] (03CR) 10BCornwall: [C:03+2] common: Update cp5018's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1299579 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [15:15:05] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1195: Migration of db1195.eqiad.wmnet completed [15:15:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:15:32] !log brett@cumin2002 START - Cookbook sre.dns.netbox [15:18:10] FIRING: [4x] GanetiBGPDown: BGP session down between ganeti5004 and cr2-eqsin - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [15:18:31] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:32] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp5018.eqsin.wmnet 18.0.132.10.in-addr.arpa 8.1.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [15:18:36] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cp5018.eqsin.wmnet 18.0.132.10.in-addr.arpa 8.1.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [15:18:36] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5018 [15:18:53] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [15:19:22] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5018 [15:19:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cp5018 [15:20:39] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist translate extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release [15:20:45] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [15:21:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:35] (03PS1) 10Muehlenhoff: Remove access for harroyo-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1300178 [15:23:06] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1009 [15:24:05] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Harroyo-wmf out of all services on: 2436 hosts [15:24:07] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1009 [15:24:23] (03PS1) 10Bking: relforge: remove cluster bootstrap config [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) [15:24:31] (03PS1) 10Cathal Mooney: ganeti5004: set up custom bgp neighbors for private1-604-eqsin vlan [puppet] - 10https://gerrit.wikimedia.org/r/1300180 (https://phabricator.wikimedia.org/T428229) [15:24:33] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:24:58] (03PS2) 10Cathal Mooney: ganeti5004: set up custom bgp neighbors for private1-604-eqsin vlan [puppet] - 10https://gerrit.wikimedia.org/r/1300180 (https://phabricator.wikimedia.org/T428229) [15:25:15] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300180 (https://phabricator.wikimedia.org/T428229) (owner: 10Cathal Mooney) [15:25:33] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:25:49] (03CR) 10Muehlenhoff: [C:03+2] Remove access for harroyo-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1300178 (owner: 10Muehlenhoff) [15:27:07] (03PS1) 10Reedy: CirrusSearch-production: Mark 'CirrusSearch Streaming Updater' as a reserved username [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300182 (https://phabricator.wikimedia.org/T428687) [15:27:58] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [15:28:17] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1206: Upgrading db1206.eqiad.wmnet [15:28:58] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1206: Upgrading db1206.eqiad.wmnet [15:30:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1206.eqiad.wmnet with OS trixie [15:30:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [15:32:29] (03Abandoned) 10Elukey: WIP - docker_registry: introduce migration backends in Nginx [puppet] - 10https://gerrit.wikimedia.org/r/1299531 (https://phabricator.wikimedia.org/T428022) (owner: 10Elukey) [15:32:33] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:33:29] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release (dblist: https://phabricator.wikimedia.org/P94013) [15:33:33] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [15:34:38] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:59] !log drain traffic through cr2-drmrs to reset pic0 [15:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:21] (03PS1) 10Arnaudb: gerrit: point mtail at relocated httpd access logs [puppet] - 10https://gerrit.wikimedia.org/r/1300187 (https://phabricator.wikimedia.org/T425667) [15:37:33] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:37:53] MatmaRex, Krinkle: filed T428772 for the encoding issue [15:37:54] T428772: Serve mediawiki keys.txt with UTF-8 charset - https://phabricator.wikimedia.org/T428772 [15:38:11] (I have no idea which tags it should have :/) [15:38:15] thanks [15:38:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2173: Migration of db2173.codfw.wmnet completed [15:38:34] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:39:13] (03CR) 10Jelto: [C:03+1] "lgtm, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1300187 (https://phabricator.wikimedia.org/T425667) (owner: 10Arnaudb) [15:40:57] (03CR) 10Arnaudb: [C:03+2] gerrit: point mtail at relocated httpd access logs [puppet] - 10https://gerrit.wikimedia.org/r/1300187 (https://phabricator.wikimedia.org/T425667) (owner: 10Arnaudb) [15:41:07] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12005475 (10VRiley-WMF) Currently running into issues with dse-k8s-worker1009, looking into this [15:41:18] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12005477 (10VRiley-WMF) 05In progress→03Open [15:41:37] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:41:39] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:41:39] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:41:50] PROBLEM - Host cr2-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [15:41:56] uh? [15:41:58] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:42:01] PROBLEM - Host cr2-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:07] topranks: XioNoX: ^ expected? [15:42:10] FIRING: [11x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:42:14] here [15:42:16] god a page :) [15:42:19] *got [15:42:25] !incidents [15:42:25] 8066 (UNACKED) Host cr2-drmrs [15:42:29] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool drmrs [reason: no reason specified, no task ID specified] [15:42:30] !ack 8066 [15:42:30] 8066 (ACKED) Host cr2-drmrs [15:42:33] I am ready to depool drmrs [15:42:39] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:42:39] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:42:44] RECOVERY - Host cr2-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.54 ms [15:42:44] need a +1 to hit it [15:42:57] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:42:58] sukhe: patch? [15:43:08] Amir1: depool cookbook [15:43:18] https://wikitech.wikimedia.org/wiki/DNS#Change_GeoDNS_/_Depool_a_Site [15:43:22] ah okay [15:43:24] I forgot [15:43:27] did you get a resolve? [15:43:31] yeah [15:43:34] ok [15:43:36] aborting for now then [15:43:39] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: depool drmrs [reason: no reason specified, no task ID specified] [15:43:59] yeah [15:44:10] both yeah, netops should look into this [15:44:16] took me a bit to figure out it was a resolve [15:45:40] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1206.eqiad.wmnet with reason: host reimage [15:45:52] Some logs for cr1 in https://librenms.wikimedia.org/device/239/logs/eventlog [15:46:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:46:43] did we miss some maintenance ? [15:46:49] like external [15:46:57] https://librenms.wikimedia.org/graphs/type=device_state/device=239/from=1749570300/ [15:47:03] (03CR) 10Ayounsi: [C:03+1] ganeti5004: set up custom bgp neighbors for private1-604-eqsin vlan [puppet] - 10https://gerrit.wikimedia.org/r/1300180 (https://phabricator.wikimedia.org/T428229) (owner: 10Cathal Mooney) [15:47:03] RECOVERY - Host cr2-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.66 ms [15:47:04] wait no, that'sold [15:47:10] RESOLVED: [11x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:47:32] sukhe: sold? [15:47:37] sorry, that's old [15:47:51] the only thing I see in noc@ is a maintenance that was completed 7 hours ago [15:47:55] and that too for DE-CIX [15:48:25] elukey: yeah nothing I can see, so defering to topranks and XioNoX [15:48:57] sorry this is all me [15:49:28] (03PS2) 10Ryan Kemper: hadoop.reboot-workers: drop custom --dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/1287290 (https://phabricator.wikimedia.org/T411568) [15:49:38] Amir1, sukhe: apologies [15:49:51] I will complete the work then take stock, I forgot to downtime but some doesn't make sense to me [15:49:58] https://grafana.wikimedia.org/goto/bfoq8319683y8d?orgId=1 for a useful dashboard when it comes to site to site connectivity [15:50:02] overall I think things are ok [15:50:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1206.eqiad.wmnet with reason: host reimage [15:50:40] topranks: no worries, thanks for sharing! we were wondering if it's something external [15:50:43] (03PS1) 10Hnowlan: thumbor: introduce improve ECS logger, return body on error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300191 (https://phabricator.wikimedia.org/T368180) [15:50:44] (03Abandoned) 10Ryan Kemper: airflow-test-k8s: add ldap-sync task-pod egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286750 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [15:50:59] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:51:10] sukhe: no I have to drain cr2-drmrs of traffic to change the port config [15:51:44] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [15:51:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b13-drmrs:et-0/0/48 (Core: cr2-drmrs:et-0/0/1 {#D0102}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b13-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:51:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:52:25] FIRING: [17x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:52:39] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:52:40] FIRING: [17x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:52:48] (03CR) 10JMeybohm: "Did you verify this build runs on bookworm as well?" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1300145 (https://phabricator.wikimedia.org/T427403) (owner: 10Jelto) [15:53:43] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:54:22] FIRING: CertAlmostExpired: gNMI TLS certificate for cr2-drmrs.wikimedia.org is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:54:38] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:54:38] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr2-drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:54:54] (03CR) 10Ladsgroup: [C:03+1] "ship it 🚢" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300191 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [15:55:01] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:56:41] jouncebot: nowandnext [15:56:41] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [15:56:41] In 1 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1700) [15:56:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-b12-drmrs:et-0/0/50 (Core: cr2-drmrs:et-0/0/2 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:56:54] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:57:25] RESOLVED: [17x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:59:18] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [15:59:22] RESOLVED: CertAlmostExpired: gNMI TLS certificate for cr2-drmrs.wikimedia.org is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:59:38] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:59:48] folks not 100% sure what happened, the disruptive command I issued was at 15:48, prior to that I drained the router but that was only BGP preference, the OSPF adjcancies/errors do not make sense [16:01:43] !log apt: uploaded libvmod-wmfuniq 0.3.0 for trixie [16:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:24] (03PS1) 10Elukey: cumin: remove SubjectAltNameWarning suppression [puppet] - 10https://gerrit.wikimedia.org/r/1300194 (https://phabricator.wikimedia.org/T427897) [16:03:23] would I be okay to do a little thumbor maintenance? starting with staging first [16:04:23] (03CR) 10Andrew Bogott: [C:03+2] add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [16:04:34] (03CR) 10Andrew Bogott: [C:03+2] cloud cumin: use ubuntu@ when reaching Trove database instances [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [16:04:57] hnowlan: Thank you! I'm around [16:07:04] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12005551 (10VRiley-WMF) I have been working with @Jclark-ctr on this. It was pointed out that it looks like only updated site.pp file has been updated but not yaml @BTullis would you be able to... [16:07:18] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1206.eqiad.wmnet with OS trixie [16:07:33] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:07:37] (03CR) 10Hnowlan: [C:03+2] thumbor: introduce improve ECS logger, return body on error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300191 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [16:07:38] Amir1: ,3 [16:07:40] *<3 [16:09:55] (03Merged) 10jenkins-bot: thumbor: introduce improve ECS logger, return body on error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300191 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [16:11:34] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:12:24] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:12:25] (03CR) 10Cathal Mooney: [C:03+2] ganeti5004: set up custom bgp neighbors for private1-604-eqsin vlan [puppet] - 10https://gerrit.wikimedia.org/r/1300180 (https://phabricator.wikimedia.org/T428229) (owner: 10Cathal Mooney) [16:12:33] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:12:40] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [16:12:43] (03CR) 10Volans: [C:03+1] "If the puppetdb certs have the FQDN in the SAN I think it can be safely removed even if Debian (03CR) 10Elukey: [C:03+2] cumin: remove SubjectAltNameWarning suppression [puppet] - 10https://gerrit.wikimedia.org/r/1300194 (https://phabricator.wikimedia.org/T427897) (owner: 10Elukey) [16:14:25] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:14:38] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:15:23] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [16:15:30] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:16:03] (03CR) 10Ryan Kemper: [C:03+2] hadoop.reboot-workers: drop custom --dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/1287290 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [16:16:12] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12005583 (10elukey) @jcrespo can you retry the test when you have a moment? Should work now :) [16:16:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:44] (03CR) 10AOkoth: [C:03+1] gitlab: advertise gitlab-ssh.wikimedia.org in UI clone URLs [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [16:16:54] (03Abandoned) 10Ryan Kemper: global_config: add ldap-sync external services [puppet] - 10https://gerrit.wikimedia.org/r/1286748 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [16:17:58] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1206: Migration of db1206.eqiad.wmnet completed [16:18:55] (03Merged) 10jenkins-bot: hadoop.reboot-workers: drop custom --dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/1287290 (https://phabricator.wikimedia.org/T411568) (owner: 10Ryan Kemper) [16:20:28] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:21:35] (03Abandoned) 10Marco Fossati: TestKitchen: enable instrument config fetching on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300085 (https://phabricator.wikimedia.org/T426231) (owner: 10Marco Fossati) [16:21:51] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12005615 (10jcrespo) I will try on Friday or ASAP. [16:22:33] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:22:35] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:25:36] (03PS1) 10Andrew Bogott: openstack::apply_security_groups: only run on one cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1300197 (https://phabricator.wikimedia.org/T422801) [16:25:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300197 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [16:26:08] (03CR) 10CI reject: [V:04-1] openstack::apply_security_groups: only run on one cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1300197 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [16:26:42] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#12005652 (10cmooney) 05Open→03Resolved All work on this is now complete. [16:28:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5018.eqsin.wmnet with OS trixie [16:34:00] !log bblack@cumin1003 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7008.magru.wmnet} and A:cp - Upgrade wmfuniq to 0.3.0 () [16:34:38] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:05] (03CR) 10Aqu: [V:03+1 C:03+1] "We keep a separate job for commonswiki rather than folding the table into the existing grouped run. globalimagelinks only exists on common" [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [16:35:24] (03PS2) 10Andrew Bogott: openstack::apply_security_groups: only run on one cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1300197 (https://phabricator.wikimedia.org/T422801) [16:35:40] RESOLVED: GanetiBGPDown: BGP session down between ganeti5004 and cr3-eqsin - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Ganeti4&var-bgp_neighbor=ganeti5004 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [16:37:33] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:37:38] (03PS1) 10RLazarus: admin: Actually add apdube to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1300202 (https://phabricator.wikimedia.org/T427553) [16:39:08] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:39:55] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300197 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [16:39:56] !log bblack@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7008.magru.wmnet} and A:cp - Upgrade wmfuniq to 0.3.0 () [16:40:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300073 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [16:41:14] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:44:57] (03PS3) 10Andrew Bogott: openstack::apply_security_groups: only run on one cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1300197 (https://phabricator.wikimedia.org/T422801) [16:45:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300197 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [16:47:41] (03PS2) 10Ryan Kemper: relforge: comment out cluster bootstrap config [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [16:47:41] (03CR) 10Hnowlan: [C:03+1] admin: Actually add apdube to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1300202 (https://phabricator.wikimedia.org/T427553) (owner: 10RLazarus) [16:48:48] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [16:49:07] (03CR) 10RLazarus: [C:03+2] admin: Actually add apdube to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1300202 (https://phabricator.wikimedia.org/T427553) (owner: 10RLazarus) [16:49:16] (03CR) 10Ryan Kemper: [C:03+1] "heading out to walk dog so +1 for now, contingent upon pcc not showing anything crazy (it won't, this is a simple patch)" [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [16:50:23] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS trixie [16:50:43] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host kafka-main2010 [16:51:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to for - https://phabricator.wikimedia.org/T427553#12005832 (10RLazarus) My fault, sorry about that! Please wait up to 30 minutes again, for the fix to propagate everywhere, then give it another... [16:53:46] jasmine@cumin2002 reimage (PID 1790524) is awaiting input [16:54:26] (03PS1) 10Jasmine: hieradata/common.yaml: add new IPs for kafka-main2010, following vlan migration [puppet] - 10https://gerrit.wikimedia.org/r/1300208 (https://phabricator.wikimedia.org/T427088) [16:56:01] (03CR) 10Jasmine: [C:03+2] hieradata/common.yaml: add new IPs for kafka-main2010, following vlan migration [puppet] - 10https://gerrit.wikimedia.org/r/1300208 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [16:56:11] (03CR) 10Jasmine: [C:03+2] kafka-main2010: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288921 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [16:57:25] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [16:58:15] (03PS3) 10Ryan Kemper: relforge: comment out cluster bootstrap config [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [16:58:20] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [16:58:36] (03CR) 10Ryan Kemper: [C:03+1] "must be stale facts or something. since i'm about to head out, i just switched to manual hosts declaration in the meantime" [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [16:59:34] (03CR) 10Aqu: [V:03+1 C:03+1] "Testing (all wikis):" [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1700) [17:00:13] (03CR) 10Aqu: [V:03+1 C:03+1] Add filerevision to the mediawiki not-history sqoop (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [17:01:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to for - https://phabricator.wikimedia.org/T427553#12005916 (10APDube-WMF) @RLazarus Works perfectly now. Thanks so much for the quick fix. [17:01:52] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main2010 - jasmine@cumin2002" [17:02:14] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main2010 - jasmine@cumin2002" [17:02:14] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:02:14] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache kafka-main2010.codfw.wmnet 35.48.192.10.in-addr.arpa 5.3.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:02:18] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafka-main2010.codfw.wmnet 35.48.192.10.in-addr.arpa 5.3.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:02:19] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-main2010 [17:02:39] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-main2010 [17:02:39] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-main2010 [17:03:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1206: Migration of db1206.eqiad.wmnet completed [17:03:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [17:04:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to for  - https://phabricator.wikimedia.org/T427553#12005930 (10RLazarus) 05Open→03Resolved Great! [17:05:24] (03CR) 10Andrew Bogott: [C:03+2] openstack::apply_security_groups: only run on one cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1300197 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:07:58] (03PS1) 10Hnowlan: thumbor: emit structured logs from haproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300211 (https://phabricator.wikimedia.org/T368180) [17:12:49] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox and (A:dnsbox) [17:13:03] so i'm making the release candidate, i uploaded the files, but this still shows as empty: https://releases.wikimedia.org/mediawiki/1.46/ - i think it's just cached [17:13:40] i think this would fix it? `echo 'https://releases.wikimedia.org/mediawiki/1.46/' | mwscript-k8s --attach --comment=T423267 purgeList mediawikiwiki` [17:13:41] T423267: Release MW 1.46.0-rc.0 - https://phabricator.wikimedia.org/T423267 [17:13:58] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:13:59] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:14:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [17:14:07] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:14:08] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:14:16] !log jasmine@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:14:17] !log jasmine@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:14:26] !log jasmine@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:14:27] !log jasmine@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:14:34] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [17:14:36] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [17:14:43] !log jasmine@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [17:14:44] !log jasmine@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [17:14:52] !log jasmine@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [17:14:53] !log jasmine@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [17:15:01] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:15:03] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:15:10] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [17:15:11] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:15:17] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [17:15:19] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:15:23] (03CR) 10Hnowlan: [C:03+1] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1300132 (owner: 10Muehlenhoff) [17:15:25] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [17:15:26] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:15:43] MatmaRex: I see files from today [17:16:13] also confirmed they are on both backend servers [17:16:13] mutante: hmm, i don't :) [17:16:18] yeah, the files are there [17:16:36] and i see them if i view e.g. https://releases.wikimedia.org/mediawiki/1.46/?C=N;O=A (sorted by name) [17:16:38] how about https://releases.wikimedia.org/mediawiki/1.46/?foo=bla [17:16:40] just not on https://releases.wikimedia.org/mediawiki/1.46/ [17:16:55] presumably since i viewed that after creating the directory, but before uploading the files [17:17:22] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:17:31] i suppose it'll fix itself… not sure how long it'll take [17:17:44] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:17:47] MatmaRex: try again now [17:17:52] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:18:01] mutante: yep, fixed. thanks [17:18:06] I did run the purgeList.php [17:18:08] but like this: [17:18:16] echo 'https://releases.wikimedia.org/mediawiki/1.46/' | mwscript-k8s --attach -- purgeList.php [17:18:34] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:18:41] thanks. i wasn't sure how it's used [17:18:42] !log jasmine@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:19:10] !log jasmine@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:19:18] !log jasmine@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:20:07] !log jasmine@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:20:15] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [17:20:42] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [17:20:47] MatmaRex: I like the timing because I was planning to upgrade and switch the backend servers of that [17:20:50] !log jasmine@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [17:21:40] !log jasmine@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [17:21:48] !log jasmine@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [17:22:08] !log bblack@cumin1003 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text and not P{cp7008*} and A:cp - Upgrade wmfuniq to 0.3.0 () [17:22:33] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [17:22:39] !log jasmine@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [17:22:40] !log bblack@cumin1003 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload and not P{cp7008.magru.wmnet} and A:cp - Upgrade wmfuniq to 0.3.0 () [17:22:47] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:22:54] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2174: Upgrading db2174.codfw.wmnet [17:23:14] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:23:18] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage [17:23:21] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [17:23:26] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2174: Upgrading db2174.codfw.wmnet [17:23:48] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [17:24:08] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1218: Upgrading db1218.eqiad.wmnet [17:24:10] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:24:16] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [17:24:17] (03CR) 10Bking: [C:03+2] relforge: comment out cluster bootstrap config [puppet] - 10https://gerrit.wikimedia.org/r/1300179 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [17:24:39] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1218: Upgrading db1218.eqiad.wmnet [17:24:48] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:24:54] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [17:25:44] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:26:00] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2174.codfw.wmnet with OS trixie [17:26:20] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1218.eqiad.wmnet with OS trixie [17:28:52] (03CR) 10Dzahn: "thanks for the review. I would love if I could merge it now, but that's going to break things." [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:29:28] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage [17:30:19] (03PS1) 10Andrew Bogott: Add openstack::apply_security_groups in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1300215 (https://phabricator.wikimedia.org/T422801) [17:32:21] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12006084 (10BTullis) >>! In T401441#12005550, @VRiley-WMF wrote: > I have been working with @Jclark-ctr on this. It was pointed out that it looks like only updated site.pp file has been updated... [17:33:22] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release (dblist: https://phabricator.wikimedia.org/P94021) [17:33:27] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [17:34:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300215 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:39:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [17:42:53] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1218.eqiad.wmnet with reason: host reimage [17:44:24] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2174.codfw.wmnet with reason: host reimage [17:44:58] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [17:45:20] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [17:45:34] (03PS2) 10Andrew Bogott: Add openstack::apply_security_groups in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1300215 (https://phabricator.wikimedia.org/T422801) [17:45:40] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300215 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:46:18] (03PS2) 10Kimberly Sarabia: Remove custom streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298875 (https://phabricator.wikimedia.org/T423148) [17:46:42] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2010.codfw.wmnet with OS trixie [17:49:11] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1218.eqiad.wmnet with reason: host reimage [17:52:44] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2174.codfw.wmnet with reason: host reimage [17:58:12] bblack@cumin1003 roll-upgrade-varnish (PID 3007919) is awaiting input [18:00:05] dduvall and jnuche: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T1800). [18:03:47] bblack@cumin1003 roll-upgrade-varnish (PID 3007919) is awaiting input [18:06:24] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1218.eqiad.wmnet with OS trixie [18:10:26] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2174.codfw.wmnet with OS trixie [18:12:01] o/ about the roll the train [18:12:54] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300225 (https://phabricator.wikimedia.org/T423915) [18:12:57] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300225 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [18:13:51] (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300225 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [18:15:02] (03CR) 10Andrew Bogott: [C:03+2] Add openstack::apply_security_groups in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1300215 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [18:16:45] (03CR) 10ArielGlenn: [C:03+1] "Subject to the typo fix and the timeout line (if it needs anything), this looks good for a first rollout. (1 of 3)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [18:16:52] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5018.* [18:17:05] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1218: Migration of db1218.eqiad.wmnet completed [18:17:34] (03CR) 10ArielGlenn: [C:03+1] "Subject to changes to the timeout lines (if they need anything), this looks good for a first rollout. (2/3)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) (owner: 10Daniel Kinzler) [18:18:15] (03CR) 10ArielGlenn: "Subject to the one question I have about the commit message, this looks good for a first rollout. (3/3)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [18:20:26] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.47.0-wmf.6 refs T423915 [18:20:31] T423915: 1.47.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T423915 [18:20:39] (03PS1) 10GergesShamon: [arzwiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300226 (https://phabricator.wikimedia.org/T427720) [18:21:26] (03CR) 10CI reject: [V:04-1] [arzwiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300226 (https://phabricator.wikimedia.org/T427720) (owner: 10GergesShamon) [18:21:32] (03PS2) 10GergesShamon: [arzwiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300226 (https://phabricator.wikimedia.org/T427720) [18:22:18] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2174: Migration of db2174.codfw.wmnet completed [18:22:20] (03CR) 10CI reject: [V:04-1] [arzwiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300226 (https://phabricator.wikimedia.org/T427720) (owner: 10GergesShamon) [18:23:49] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS trixie [18:24:21] !log brett@cumin2002 START - Cookbook sre.hosts.move-vlan for host cp5020 [18:25:17] (03PS1) 10BCornwall: common: Update cp5020's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1300227 (https://phabricator.wikimedia.org/T428229) [18:27:24] brett@cumin2002 reimage (PID 1811605) is awaiting input [18:30:18] (03PS3) 10GergesShamon: [arzwiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300226 (https://phabricator.wikimedia.org/T427720) [18:31:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300226 (https://phabricator.wikimedia.org/T427720) (owner: 10GergesShamon) [18:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:37:06] PROBLEM - Check unit status of security_group_ssh-from-restricted-bastion_to_project_trove on cloudcontrol1006 is CRITICAL: CRITICAL: Status of the systemd unit security_group_ssh-from-restricted-bastion_to_project_trove https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:43:54] (03CR) 10Ssingh: [C:03+1] common: Update cp5020's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1300227 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [18:45:00] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [18:46:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:47:37] (03PS1) 10Bking: cirrussearch: create docker-based role for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1300232 (https://phabricator.wikimedia.org/T425585) [18:48:10] (03CR) 10CI reject: [V:04-1] cirrussearch: create docker-based role for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1300232 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [18:50:08] (03PS1) 10Andrew Bogott: apply_security_groups.pp: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/1300233 [18:51:02] (03CR) 10Andrew Bogott: [C:03+2] apply_security_groups.pp: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/1300233 (owner: 10Andrew Bogott) [18:51:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:55:59] (03CR) 10BCornwall: [C:03+2] common: Update cp5020's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1300227 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [19:02:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1218: Migration of db1218.eqiad.wmnet completed [19:02:37] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [19:07:06] RECOVERY - Check unit status of security_group_ssh-from-restricted-bastion_to_project_trove on cloudcontrol1006 is OK: OK: Status of the systemd unit security_group_ssh-from-restricted-bastion_to_project_trove https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:07:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2174: Migration of db2174.codfw.wmnet completed [19:07:50] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [19:10:46] (03PS1) 10BCornwall: roll-upgrade-varnish: glob varnishkafka-all svc [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 [19:11:06] !log brett@cumin2002 START - Cookbook sre.dns.netbox [19:11:51] (03CR) 10Ssingh: roll-upgrade-varnish: glob varnishkafka-all svc (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 (owner: 10BCornwall) [19:13:39] (03CR) 10BCornwall: roll-upgrade-varnish: glob varnishkafka-all svc (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 (owner: 10BCornwall) [19:14:12] (03CR) 10Ssingh: roll-upgrade-varnish: glob varnishkafka-all svc (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 (owner: 10BCornwall) [19:14:33] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2044.codfw.wmnet} and A:cp - testing 1300236 () [19:15:10] (03CR) 10Ssingh: "Can you also paste the output of VTC tests, for both text and upload? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [19:17:53] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5020 - brett@cumin2002" [19:17:59] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5020 - brett@cumin2002" [19:17:59] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:59] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp5020.eqsin.wmnet 24.0.132.10.in-addr.arpa 4.2.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [19:18:03] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cp5020.eqsin.wmnet 24.0.132.10.in-addr.arpa 4.2.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [19:18:04] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5020 [19:18:53] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2044.codfw.wmnet} and A:cp - testing 1300236 () [19:18:57] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5020 [19:18:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cp5020 [19:19:33] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2046.codfw.wmnet} and A:cp - testing 1300236 () [19:21:22] (03PS2) 10BCornwall: roll-upgrade-varnish: glob varnishkafka-all svc [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 [19:22:23] (03CR) 10Ssingh: [C:03+1] roll-upgrade-varnish: glob varnishkafka-all svc [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 (owner: 10BCornwall) [19:23:18] (03CR) 10Ssingh: [C:03+1] "Per Brett: https://man7.org/linux/man-pages/man1/systemctl.1.html" [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 (owner: 10BCornwall) [19:23:25] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2046.codfw.wmnet} and A:cp - testing 1300236 () [19:24:21] (03CR) 10BCornwall: "`END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2044.codfw.wmnet} and A:cp - testing 1" [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 (owner: 10BCornwall) [19:24:31] (03CR) 10BCornwall: [V:03+2 C:03+2] roll-upgrade-varnish: glob varnishkafka-all svc [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 (owner: 10BCornwall) [19:27:23] (03Merged) 10jenkins-bot: roll-upgrade-varnish: glob varnishkafka-all svc [cookbooks] - 10https://gerrit.wikimedia.org/r/1300236 (owner: 10BCornwall) [19:27:42] !log bblack@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-upload and not P{cp7008.magru.wmnet} and A:cp - Upgrade wmfuniq to 0.3.0 () [19:30:13] !log bblack@cumin1003 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload and not P{cp7008.magru.wmnet} and A:cp - Upgrade wmfuniq to 0.3.0 () [19:42:08] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#12006694 (10MLechvien-WMF) [19:43:19] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#12006699 (10MLechvien-WMF) a:03jijiki [19:46:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:46:50] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:51:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:51:50] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:52:09] <_Gerges> Ping [19:53:15] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [19:59:14] (03PS2) 10Bking: cirrussearch: create docker-based role for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1300232 (https://phabricator.wikimedia.org/T425585) [19:59:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T2000). Please do the needful. [20:00:05] apaskulin and _Gerges: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] <_Gerges> Here [20:05:25] (03CR) 10Bking: [C:03+2] cirrussearch: create docker-based role for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1300232 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [20:09:25] <_Gerges> Ping [20:14:05] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#12006862 (10AKanji-WMF) @Pcoombe could you please advise as to whether this is something we should/can resolve in the ne... [20:14:50] <_Gerges> @RoanKattouw, @urbanecm, @TheresNoTime, @kindrobot, and @cjming: ping [20:15:50] Hi sorry for missing the first ping, I can deploy [20:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300226 (https://phabricator.wikimedia.org/T427720) (owner: 10GergesShamon) [20:18:57] (03PS9) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:19:44] (03Merged) 10jenkins-bot: [arzwiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300226 (https://phabricator.wikimedia.org/T427720) (owner: 10GergesShamon) [20:20:10] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1300226|[arzwiki] Change the wordmark (T427720)]] [20:20:15] T427720: Revert unintended arzwiki wordmark changes introduced by T374430 - https://phabricator.wikimedia.org/T427720 [20:22:29] !log catrope@deploy1003 gergesshamon, catrope: Backport for [[gerrit:1300226|[arzwiki] Change the wordmark (T427720)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:22:33] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:24:18] <_Gerges> I tested a patch, everything is fine :) [20:24:28] <_Gerges> You can continue [20:24:41] (03PS1) 10Bking: deployment-prep: activate new cirrussearch profile [puppet] - 10https://gerrit.wikimedia.org/r/1300242 (https://phabricator.wikimedia.org/T425585) [20:25:37] !log catrope@deploy1003 gergesshamon, catrope: Continuing with deployment [20:29:59] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300226|[arzwiki] Change the wordmark (T427720)]] (duration: 09m 49s) [20:30:05] T427720: Revert unintended arzwiki wordmark changes introduced by T374430 - https://phabricator.wikimedia.org/T427720 [20:30:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300073 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [20:30:52] (03PS10) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:30:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5020.eqsin.wmnet with OS trixie [20:32:00] (03Merged) 10jenkins-bot: wgRestSandboxSpecs: Add Lift Wing API to documentation wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300073 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [20:32:27] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1300073|wgRestSandboxSpecs: Add Lift Wing API to documentation wikis (T427902)]] [20:32:32] T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files. - https://phabricator.wikimedia.org/T427902 [20:33:03] (03CR) 10Dzahn: [C:03+2] gerrit: flip direction of symlink for log directories [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [20:33:19] <_Gerges> There is a command to purge cache that is used after changing the logo. Can you run this command? [20:34:00] (03PS11) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:34:35] !log catrope@deploy1003 catrope, gkyziridis: Backport for [[gerrit:1300073|wgRestSandboxSpecs: Add Lift Wing API to documentation wikis (T427902)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:36:33] _Gerges: which logo URL is this about? [20:37:04] (03PS12) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:37:16] (03CR) 10Bking: [C:03+2] deployment-prep: activate new cirrussearch profile [puppet] - 10https://gerrit.wikimedia.org/r/1300242 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [20:37:30] <_Gerges> Wordmark arzwiki [20:37:33] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:37:48] do you have a URL? [20:39:47] (03PS13) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:40:03] !log catrope@deploy1003 catrope, gkyziridis: Continuing with deployment [20:40:24] _Gerges: https://arz.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-ar.svg ? [20:40:55] _Gerges: if so.. done. should be updated [20:41:06] (03PS1) 10BPirkle: REST: set new RestModuleOverrides variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300245 (https://phabricator.wikimedia.org/T422756) [20:42:35] (03PS14) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:42:39] <_Gerges> https://arz.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-arz.svg [20:43:33] <_Gerges> The use of wikipedia-wordmark-ar.svg has been changed to wikipedia-wordmark-arz.svg in arzwiki [20:44:23] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300073|wgRestSandboxSpecs: Add Lift Wing API to documentation wikis (T427902)]] (duration: 11m 55s) [20:44:28] T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files. - https://phabricator.wikimedia.org/T427902 [20:44:44] All done [20:45:11] _Gerges: Right, but the content at that URL didn't change today, right? [20:45:25] We're just pointing arzwiki to a different URL, we didn't change the .svg file itself [20:45:35] (03CR) 10Clare Ming: [C:03+2] Deploy GrowthBook 4.4.0 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300173 (https://phabricator.wikimedia.org/T427506) (owner: 10Santiago Faci) [20:45:44] (But maybe it changed previously, and someone forgot to purge the cache then?) [20:47:12] <_Gerges> I don't know [20:47:17] (03PS19) 10JHathaway: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [20:47:39] (03Merged) 10jenkins-bot: Deploy GrowthBook 4.4.0 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300173 (https://phabricator.wikimedia.org/T427506) (owner: 10Santiago Faci) [20:47:58] (03CR) 10Dzahn: [C:03+2] "root@gerrit2002:/var/log/gerrit# pwd -P" [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [20:48:22] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5020.* [20:48:29] (03CR) 10Dzahn: [C:03+2] "I moved the existing files from old to new location.. then deleted the dir and let puppet create the link." [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [20:49:30] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS trixie [20:49:46] (03CR) 10JHathaway: redfish: improve add_account with AccountTypes (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [20:50:03] !log brett@cumin2002 START - Cookbook sre.hosts.move-vlan for host cp5024 [20:50:21] RoanKattouw: _Gerges: I purged both URLs, "ar" and "arz". not sure if that was helpful now [20:50:22] (03CR) 10CI reject: [V:04-1] redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [20:50:26] (03CR) 10Cwhite: [C:03+1] thumbor: emit structured logs from haproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300211 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [20:50:59] (03PS1) 10BCornwall: common: Update cp5024's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1300246 (https://phabricator.wikimedia.org/T428229) [20:51:44] <_Gerges> Thank you both [20:53:06] brett@cumin2002 reimage (PID 1843502) is awaiting input [20:53:26] (03CR) 10CDobbins: [C:03+1] common: Update cp5024's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1300246 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [20:53:27] (03PS1) 10Catrope: Revert "wgRestSandboxSpecs: Add Lift Wing API to documentation wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300247 [20:53:34] (03CR) 10BCornwall: [C:03+2] common: Update cp5024's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1300246 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [20:53:44] (03PS2) 10Catrope: Revert "wgRestSandboxSpecs: Add Lift Wing API to documentation wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300247 (https://phabricator.wikimedia.org/T427902) [20:54:01] (03PS1) 10Jforrester: tests: Fix StandaloneHooksTest ordering, now broken by DB upgrade [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300248 [20:54:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300247 (https://phabricator.wikimedia.org/T427902) (owner: 10Catrope) [20:54:31] !log brett@cumin2002 START - Cookbook sre.dns.netbox [20:54:58] (03Merged) 10jenkins-bot: Revert "wgRestSandboxSpecs: Add Lift Wing API to documentation wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300247 (https://phabricator.wikimedia.org/T427902) (owner: 10Catrope) [20:55:24] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1300247|Revert "wgRestSandboxSpecs: Add Lift Wing API to documentation wikis" (T427902)]] [20:55:29] T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files. - https://phabricator.wikimedia.org/T427902 [20:57:29] !log catrope@deploy1003 catrope: Backport for [[gerrit:1300247|Revert "wgRestSandboxSpecs: Add Lift Wing API to documentation wikis" (T427902)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:57:59] !log catrope@deploy1003 catrope: Continuing with deployment [20:59:55] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5024 - brett@cumin2002" [21:00:00] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5024 - brett@cumin2002" [21:00:01] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:00:01] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp5024.eqsin.wmnet 35.0.132.10.in-addr.arpa 5.3.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T2100) [21:00:05] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cp5024.eqsin.wmnet 35.0.132.10.in-addr.arpa 5.3.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [21:00:06] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5024 [21:02:14] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300247|Revert "wgRestSandboxSpecs: Add Lift Wing API to documentation wikis" (T427902)]] (duration: 06m 51s) [21:02:20] T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files. - https://phabricator.wikimedia.org/T427902 [21:02:59] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5024 [21:02:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cp5024 [21:03:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815 (10BLiviero-WMF) 03NEW [21:05:17] (03PS15) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [21:05:31] (03PS1) 10Jforrester: ExecuteTestAndCacheJob: Fix stdClasses serialised wrongly by JobQueue [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300250 (https://phabricator.wikimedia.org/T428801) [21:05:58] FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:06] RoanKattouw: You done with deploys? [21:07:14] Assuming yes. [21:07:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300250 (https://phabricator.wikimedia.org/T428801) (owner: 10Jforrester) [21:07:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300248 (owner: 10Jforrester) [21:07:35] (03PS16) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [21:09:34] (03Merged) 10jenkins-bot: tests: Fix StandaloneHooksTest ordering, now broken by DB upgrade [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300248 (owner: 10Jforrester) [21:10:58] RESOLVED: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:11:10] hm [21:12:10] (03PS20) 10JHathaway: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [21:13:07] (03Merged) 10jenkins-bot: ExecuteTestAndCacheJob: Fix stdClasses serialised wrongly by JobQueue [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300250 (https://phabricator.wikimedia.org/T428801) (owner: 10Jforrester) [21:13:35] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1300250|ExecuteTestAndCacheJob: Fix stdClasses serialised wrongly by JobQueue (T428801)]], [[gerrit:1300248|tests: Fix StandaloneHooksTest ordering, now broken by DB upgrade]] [21:13:40] T428801: TypeError: MediaWiki\Extension\WikiLambda\OrchestratorRequest::orchestrateTestExecution(): Argument #1 ($testCall) must be of type stdClass, array given, called in /srv/mediawiki/php-1.47.0-wmf.6/extensions/WikiLambda/includes/ - https://phabricator.wikimedia.org/T428801 [21:15:40] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1300250|ExecuteTestAndCacheJob: Fix stdClasses serialised wrongly by JobQueue (T428801)]], [[gerrit:1300248|tests: Fix StandaloneHooksTest ordering, now broken by DB upgrade]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:17:41] !log jforrester@deploy1003 jforrester: Continuing with deployment [21:21:58] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300250|ExecuteTestAndCacheJob: Fix stdClasses serialised wrongly by JobQueue (T428801)]], [[gerrit:1300248|tests: Fix StandaloneHooksTest ordering, now broken by DB upgrade]] (duration: 08m 23s) [21:22:03] T428801: TypeError: MediaWiki\Extension\WikiLambda\OrchestratorRequest::orchestrateTestExecution(): Argument #1 ($testCall) must be of type stdClass, array given, called in /srv/mediawiki/php-1.47.0-wmf.6/extensions/WikiLambda/includes/ - https://phabricator.wikimedia.org/T428801 [21:23:06] (03CR) 10Kamila Součková: rest gateway: implement cost-based rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [21:28:07] James_F: Sorry, yes I was done [21:28:13] Good. ;-) [21:32:10] (03CR) 10Kamila Součková: "LGTM except Ariel's point" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [21:35:03] jouncebot: now [21:35:03] For the next 0 hour(s) and 24 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T2100) [21:35:09] jouncebot: nowandnext [21:35:09] For the next 0 hour(s) and 24 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T2100) [21:35:09] In 0 hour(s) and 24 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T2200) [21:37:16] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage [21:37:39] (03CR) 10Kamila Součková: rest gateway: implement cost-based rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [21:37:49] (03CR) 10Kamila Součková: "LGTM except Ariel's points." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [21:38:52] (03CR) 10Kamila Součková: "LGTM except Ariel's points" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) (owner: 10Daniel Kinzler) [21:39:38] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:44:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage [21:47:29] (03CR) 10Dzahn: [C:03+2] releases: remove outdated comments about releases-jenkins in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1299585 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [22:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T2200) [22:02:11] jouncebot: now [22:02:11] For the next 0 hour(s) and 57 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260610T2200) [22:06:00] !log gerrit-replica: restarting gerrit [22:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:57] !log gerrit-spare: restarting gerrit [22:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:32] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on gerrit2003.wikimedia.org with reason: service restart [22:11:41] !log dzahn@cumin2002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on gerrit.wikimedia.org with reason: service restart [22:13:35] !log gerrit - restarting service for logging change [22:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5024.eqsin.wmnet with OS trixie [22:16:46] FIRING: [4x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [22:16:51] FIRING: [2x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [22:17:07] I down timed it as much as possible. [22:17:16] this was a service restart and it's back for me right now. [22:21:46] RESOLVED: [4x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [22:21:51] RESOLVED: [2x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [22:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [22:55:03] (03PS3) 10Ladsgroup: wikimedia.org: Introduce thumb.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1298821 (https://phabricator.wikimedia.org/T427465) [22:55:10] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wikimedia.org: Introduce thumb.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1298821 (https://phabricator.wikimedia.org/T427465) (owner: 10Ladsgroup) [22:55:30] !log ladsgroup@dns1004 START - running authdns-update [22:56:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:56:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:57:07] !log ladsgroup@dns1004 END - running authdns-update [22:58:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827 (10EChukwukere-WMF) 03NEW [22:59:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:02:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:02:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300154 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [23:03:35] (03Merged) 10jenkins-bot: Disable ShortUrl on bdwikimedia, bhwiki, bnwiki, bnwikisource, eswikibooks, gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300154 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [23:04:02] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1300154|Disable ShortUrl on bdwikimedia, bhwiki, bnwiki, bnwikisource, eswikibooks, gomwiki (T107188)]] [23:04:07] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [23:05:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:05:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:06:11] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1300154|Disable ShortUrl on bdwikimedia, bhwiki, bnwiki, bnwikisource, eswikibooks, gomwiki (T107188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:08:38] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:08:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:09:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:11:17] !log krinkle@deploy1003 krinkle: Continuing with deployment [23:14:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:15:40] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300154|Disable ShortUrl on bdwikimedia, bhwiki, bnwiki, bnwikisource, eswikibooks, gomwiki (T107188)]] (duration: 11m 37s) [23:15:45] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [23:19:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:19:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:21:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:21:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:23:09] (03PS12) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [23:24:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:25:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:30:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1299653 (owner: 10TrainBranchBot) [23:30:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:35:33] (03PS1) 10Jforrester: wikifunctions: Switch JavaScript evaluator to Rust-based version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300271 (https://phabricator.wikimedia.org/T417870) [23:35:35] (03PS1) 10Jforrester: wikifunctions: Drop temporary Rust evaluator releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300272 (https://phabricator.wikimedia.org/T417870) [23:36:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:36:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:39:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:40:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1300274 [23:40:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1300274 (owner: 10TrainBranchBot) [23:41:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:42:38] (03PS13) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [23:44:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:46:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:46:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:49:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:50:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:51:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:51:53] (03PS14) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [23:53:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1300274 (owner: 10TrainBranchBot) [23:53:38] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:53:44] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5024.* [23:54:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:56:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:57:46] (03PS15) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [23:59:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal