[00:02:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1382.eqiad.wmnet with reason: host reimage [00:05:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1381.eqiad.wmnet with reason: host reimage [00:09:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1379.eqiad.wmnet with reason: host reimage [00:12:03] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1380.eqiad.wmnet with reason: host reimage [00:12:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1375.eqiad.wmnet with reason: host reimage [00:14:37] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:15:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:15:08] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1376.eqiad.wmnet with OS trixie [00:15:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873879 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:15:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1248 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:16:09] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1383.eqiad.wmnet with OS trixie [00:16:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [00:18:56] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:19:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:19:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1382.eqiad.wmnet with OS trixie [00:19:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:19:52] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1384.eqiad.wmnet with OS trixie [00:20:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [00:20:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1248 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:21:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:44] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1380.eqiad.wmnet with OS trixie [00:21:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873883 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:22:16] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:22:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:22:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1381.eqiad.wmnet with OS trixie [00:22:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873885 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:24:18] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1380.eqiad.wmnet with OS trixie [00:24:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873886 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [00:25:26] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:26:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:26:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1379.eqiad.wmnet with OS trixie [00:26:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:28:18] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1383.eqiad.wmnet with reason: host reimage [00:29:37] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:30:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:30:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1375.eqiad.wmnet with OS trixie [00:30:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:31:38] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1384.eqiad.wmnet with reason: host reimage [00:33:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1383.eqiad.wmnet with reason: host reimage [00:36:44] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1380.eqiad.wmnet with reason: host reimage [00:37:54] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1384.eqiad.wmnet with reason: host reimage [00:39:05] (03PS2) 10C. Scott Ananian: Increase Parsoid Read Views to 60% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279453 (https://phabricator.wikimedia.org/T424880) [00:39:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279453 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [00:39:41] (03PS2) 10C. Scott Ananian: Increase Parsoid Read Views to 100% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279454 (https://phabricator.wikimedia.org/T424880) [00:39:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279454 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [00:41:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1380.eqiad.wmnet with reason: host reimage [00:49:50] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:50:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:50:08] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1383.eqiad.wmnet with OS trixie [00:50:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:53:47] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:56:52] jclark@cumin1003 reimage (PID 2815818) is awaiting input [00:57:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:57:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1384.eqiad.wmnet with OS trixie [00:57:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:57:17] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:58:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:58:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1380.eqiad.wmnet with OS trixie [00:58:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [00:59:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873922 (10Jclark-ctr) These have finished wikikube-worker1375 wikikube-worker1376 wikik... [01:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [01:09:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1279521 [01:09:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1279521 (owner: 10TrainBranchBot) [01:21:56] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1279521 (owner: 10TrainBranchBot) [02:01:42] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:03:26] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 11h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [02:08:09] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 26s) [02:09:20] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:42] 10ops-eqiad, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1376:9290 - https://phabricator.wikimedia.org/T424917 (10phaultfinder) 03NEW [02:34:20] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:24] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:30:16] (03CR) 10Dragoniez: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279477 (https://phabricator.wikimedia.org/T424898) (owner: 10VadymTS1) [03:36:16] (03CR) 10Dragoniez: "Should we check with the folks on the task before proceeding? Configuration changes are generally technically trivial, but that the task h" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278382 (https://phabricator.wikimedia.org/T355445) (owner: 10VadymTS1) [03:43:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279477 (https://phabricator.wikimedia.org/T424898) (owner: 10VadymTS1) [03:44:41] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873981 (10Papaul) [03:46:42] (03CR) 10Dragoniez: [C:04-1] mediawikiwiki: Changetags right only for bots and administrators in MediaWiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278382 (https://phabricator.wikimedia.org/T355445) (owner: 10VadymTS1) [04:08:19] (03PS2) 10Ryan Kemper: cumin: repurpose wdqs-public, add wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/1278603 (https://phabricator.wikimedia.org/T415073) [04:08:41] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278603 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [04:21:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:54] (03PS1) 10VadymTS1: Code bugs fixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279633 [04:35:45] (03Abandoned) 10VadymTS1: mediawikiwiki: Changetags right only for bots and administrators in MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278382 (https://phabricator.wikimedia.org/T355445) (owner: 10VadymTS1) [04:35:48] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11873989 (10Ladsgroup) >>! In T414805#11873042, @Nux wrote: > > There are still loads of broken `MediaWiki:Common.css`. I'm usually... [04:36:31] (03Abandoned) 10VadymTS1: Code bugs fixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279633 (owner: 10VadymTS1) [04:39:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274928 (https://phabricator.wikimedia.org/T423461) (owner: 10Codename Noreste) [04:51:22] (03CR) 10VadymTS1: "I'm closed this change because this very old phab ticket, I don't want to take risk and have to make changes after 2 years" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278382 (https://phabricator.wikimedia.org/T355445) (owner: 10VadymTS1) [04:52:29] (03PS1) 10Marostegui: db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279649 (https://phabricator.wikimedia.org/T424792) [04:53:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2149.codfw.wmnet with reason: Reimage to Trixie [04:53:06] (03CR) 10Marostegui: [C:03+2] db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279649 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [04:53:07] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2149: Reimage to Trixie [04:53:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2149: Reimage to Trixie [04:55:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2149.codfw.wmnet with OS trixie [04:57:25] (03PS1) 10Marostegui: mariadb: Decommission pc2012 [puppet] - 10https://gerrit.wikimedia.org/r/1279650 (https://phabricator.wikimedia.org/T424201) [04:59:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc2012.codfw.wmnet [04:59:51] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission pc2012 [puppet] - 10https://gerrit.wikimedia.org/r/1279650 (https://phabricator.wikimedia.org/T424201) (owner: 10Marostegui) [05:04:27] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [05:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [05:09:34] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2012.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:09:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2012.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:09:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:09:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2012.codfw.wmnet [05:10:38] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2012.codfw.wmnet - https://phabricator.wikimedia.org/T424201#11874046 (10Marostegui) a:05Marostegui→03None [05:10:45] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2012.codfw.wmnet - https://phabricator.wikimedia.org/T424201#11874050 (10Marostegui) This is ready for DC-Ops [05:11:57] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2012.codfw.wmnet - https://phabricator.wikimedia.org/T424201#11874052 (10Marostegui) a:03Jhancock.wm [05:14:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2149.codfw.wmnet with reason: host reimage [05:18:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2149.codfw.wmnet with reason: host reimage [05:19:58] (03PS1) 10Marostegui: mariadb: Decommission db2146 [puppet] - 10https://gerrit.wikimedia.org/r/1279661 (https://phabricator.wikimedia.org/T424189) [05:20:22] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2146.codfw.wmnet [05:23:26] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2146 [puppet] - 10https://gerrit.wikimedia.org/r/1279661 (https://phabricator.wikimedia.org/T424189) (owner: 10Marostegui) [05:27:18] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [05:31:21] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2146.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:33:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2146.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:33:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:33:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2146.codfw.wmnet [05:34:04] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2146.codfw.wmnet - https://phabricator.wikimedia.org/T424189#11874067 (10Marostegui) a:05Marostegui→03Jhancock.wm [05:34:11] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2146.codfw.wmnet - https://phabricator.wikimedia.org/T424189#11874072 (10Marostegui) This is ready for DC-Ops [05:34:35] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2146.codfw.wmnet - https://phabricator.wikimedia.org/T424189#11874074 (10Marostegui) [05:35:25] (03PS1) 10Marostegui: instances.yaml: Remove db2147 [puppet] - 10https://gerrit.wikimedia.org/r/1279674 (https://phabricator.wikimedia.org/T424226) [05:35:37] (03PS1) 10Marostegui: Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279675 [05:36:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2147 [puppet] - 10https://gerrit.wikimedia.org/r/1279674 (https://phabricator.wikimedia.org/T424226) (owner: 10Marostegui) [05:36:32] (03CR) 10Marostegui: [C:03+2] Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279675 (owner: 10Marostegui) [05:37:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2147 from dbctl T424226', diff saved to https://phabricator.wikimedia.org/P92000 and previous config saved to /var/cache/conftool/dbconfig/20260430-053712-marostegui.json [05:37:18] T424226: decommission db2147.codfw.wmnet - https://phabricator.wikimedia.org/T424226 [05:38:05] (03PS1) 10Marostegui: db2147: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279679 (https://phabricator.wikimedia.org/T424226) [05:38:48] (03CR) 10Marostegui: [C:03+2] db2147: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279679 (https://phabricator.wikimedia.org/T424226) (owner: 10Marostegui) [05:40:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2149.codfw.wmnet with OS trixie [05:47:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2149: after reimage to trixie [05:47:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T0600) [06:00:05] marostegui, Amir1, and federico3: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T0600). Please do the needful. [06:03:26] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 7h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [06:32:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2149: after reimage to trixie [06:32:49] (03CR) 10Daniel Kinzler: [C:04-1] "CR-1 to remind myself that I need to understand why the diffs generated by CI look so different from the git diffs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [06:42:24] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:50:24] (03PS1) 10Muehlenhoff: Extend access for sarmbrutser [puppet] - 10https://gerrit.wikimedia.org/r/1280055 [06:54:06] (03CR) 10Muehlenhoff: [C:03+2] Extend access for sarmbrutser [puppet] - 10https://gerrit.wikimedia.org/r/1280055 (owner: 10Muehlenhoff) [07:00:05] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T0700). [07:00:05] phuedx and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:34] (03CR) 10Slyngshede: [C:03+1] admin: extend expiry_date for sarmbruster by 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1279482 (https://phabricator.wikimedia.org/T424402) (owner: 10Dzahn) [07:02:56] hello [07:03:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy5003.wikimedia.org [07:03:15] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:08:53] jmm@cumin2002 makevm (PID 1129140) is awaiting input [07:15:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [07:15:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [07:15:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:15:48] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy5003.wikimedia.org on all recursors [07:15:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy5003.wikimedia.org on all recursors [07:16:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [07:16:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy5003.wikimedia.org - jmm@cumin2002" [07:16:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [07:17:33] jouncebot: now [07:17:33] For the next 0 hour(s) and 42 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T0700) [07:17:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [07:18:36] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11874217 (10MoritzMuehlenhoff) [07:18:47] (03PS1) 10Brouberol: kafka-jumbo: set inter.broker.protocol to 3.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/1280078 (https://phabricator.wikimedia.org/T424527) [07:19:38] jmm@cumin2002 makevm (PID 1129140) is awaiting input [07:19:40] phuedx: is your change deployed? [07:19:44] (03PS2) 10Brouberol: kafka-jumbo: set inter.broker.protocol to 3.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/1280078 (https://phabricator.wikimedia.org/T424527) [07:20:30] kart_: Hey. Sorry. I was delayed [07:21:08] abijeet, kart_: Can you deploy your patch? [07:21:34] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1280078 (https://phabricator.wikimedia.org/T424527) (owner: 10Brouberol) [07:21:48] phuedx: I can deploy abijeet's change :) [07:22:10] Cool. Please do. I'll get ready to deploy mine :) [07:22:13] Sorry for the delay both [07:22:35] no problem. [07:22:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro) [07:24:11] kart_, thanks [07:25:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [07:27:34] (03PS3) 10Brouberol: kafka-jumbo: set inter.broker.protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1280078 (https://phabricator.wikimedia.org/T424527) [07:29:41] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1280078 (https://phabricator.wikimedia.org/T424527) (owner: 10Brouberol) [07:31:03] (03PS1) 10Muehlenhoff: d-i: Remove dhcpcd-base after installation completed [puppet] - 10https://gerrit.wikimedia.org/r/1280082 (https://phabricator.wikimedia.org/T414341) [07:33:02] (03PS1) 10MVernon: swift: restore 2 nodes to rings, drain 2 more for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1280083 (https://phabricator.wikimedia.org/T354872) [07:33:11] (03Merged) 10jenkins-bot: Don't load general modules as style modules [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro) [07:33:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [07:35:57] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1279079|Don't load general modules as style modules (T424618)]] [07:36:01] T424618: Increase in "Unexpected general module "ext.translate.special.XXXX in styles queue" Resourceloader errors - https://phabricator.wikimedia.org/T424618 [07:36:40] (03CR) 10JavierMonton: [C:03+1] kafka-jumbo: set inter.broker.protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1280078 (https://phabricator.wikimedia.org/T424527) (owner: 10Brouberol) [07:37:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy5003.wikimedia.org with OS bookworm [07:37:57] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1279079|Don't load general modules as style modules (T424618)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:37:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11874232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host hcaptcha-proxy5003.wikimedia.org with OS bookworm [07:38:43] abijeet: available for testing. Let me know. [07:39:12] kart_, ok [07:41:24] kart_, looks good. [07:41:29] cool [07:41:35] !log kartik@deploy1003 kartik, abi: Continuing with deployment [07:45:26] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279079|Don't load general modules as style modules (T424618)]] (duration: 09m 29s) [07:45:31] T424618: Increase in "Unexpected general module "ext.translate.special.XXXX in styles queue" Resourceloader errors - https://phabricator.wikimedia.org/T424618 [07:47:10] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [07:47:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T419961)', diff saved to https://phabricator.wikimedia.org/P92006 and previous config saved to /var/cache/conftool/dbconfig/20260430-074717-fceratto.json [07:47:22] phuedx: we're done. [07:48:00] kart_, thanks [07:48:27] Thanks [07:48:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279476 (owner: 10Phuedx) [07:54:31] (03Merged) 10jenkins-bot: JS SDK: Remove compat deprecation warnings [extensions/TestKitchen] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279476 (owner: 10Phuedx) [07:54:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T419961)', diff saved to https://phabricator.wikimedia.org/P92007 and previous config saved to /var/cache/conftool/dbconfig/20260430-075436-fceratto.json [07:55:01] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1279476|JS SDK: Remove compat deprecation warnings]] [07:56:51] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1279476|JS SDK: Remove compat deprecation warnings]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:57:46] FIRING: [2x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [08:01:16] !log phuedx@deploy1003 phuedx: Continuing with deployment [08:01:37] Checked on a group1 wiki that the deprecation warnings weren't coming through. LGTM [08:02:46] RESOLVED: [2x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [08:04:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P92008 and previous config saved to /var/cache/conftool/dbconfig/20260430-080444-fceratto.json [08:05:14] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279476|JS SDK: Remove compat deprecation warnings]] (duration: 10m 13s) [08:08:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:08:56] Cool. I think that's the window over [08:08:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:09:21] !log UTC morning backport window finished [08:09:24] !log installing rsync security updates [08:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:29] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:10:00] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:14:05] (03CR) 10Filippo Giunchedi: [C:03+1] d-i: Remove dhcpcd-base after installation completed [puppet] - 10https://gerrit.wikimedia.org/r/1280082 (https://phabricator.wikimedia.org/T414341) (owner: 10Muehlenhoff) [08:14:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P92009 and previous config saved to /var/cache/conftool/dbconfig/20260430-081452-fceratto.json [08:18:32] !log installing nginx security updates [08:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:46] (03CR) 10Hashar: [C:03+2] wm-checks-api: add tag for PostgreSQL jobs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1266965 (owner: 10Hashar) [08:23:35] (03Merged) 10jenkins-bot: wm-checks-api: add tag for PostgreSQL jobs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1266965 (owner: 10Hashar) [08:23:38] (03CR) 10Filippo Giunchedi: [C:04-1] "My understanding is that there are two issues at play here:" [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [08:25:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T419961)', diff saved to https://phabricator.wikimedia.org/P92010 and previous config saved to /var/cache/conftool/dbconfig/20260430-082501-fceratto.json [08:25:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [08:25:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T419961)', diff saved to https://phabricator.wikimedia.org/P92011 and previous config saved to /var/cache/conftool/dbconfig/20260430-082530-fceratto.json [08:25:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy5003.wikimedia.org with reason: host reimage [08:27:34] !log hashar@deploy1003 Started deploy [gerrit/gerrit@83b886a]: wm-checks-api: add tag for PostgreSQL jobs [08:27:48] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@83b886a]: wm-checks-api: add tag for PostgreSQL jobs (duration: 00m 14s) [08:29:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy5003.wikimedia.org with reason: host reimage [08:31:05] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11874323 (10Nux) >>! In T414805#11873989, @Ladsgroup wrote: >>>! In T414805#11873042, @Nux wrote: >> >> There are still loads of bro... [08:33:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T419961)', diff saved to https://phabricator.wikimedia.org/P92012 and previous config saved to /var/cache/conftool/dbconfig/20260430-083313-fceratto.json [08:33:18] (03PS1) 10Bartosz Wójtowicz: kserve-inference: allow ingress on queue-proxy port 8013. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280202 (https://phabricator.wikimedia.org/T424049) [08:35:02] (03PS2) 10Bartosz Wójtowicz: kserve-inference: allow ingress on queue-proxy port 8013. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280202 (https://phabricator.wikimedia.org/T424049) [08:42:19] (03CR) 10Dpogorzelski: [C:03+1] kserve-inference: allow ingress on queue-proxy port 8013. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280202 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [08:42:36] (03CR) 10Bartosz Wójtowicz: [C:03+2] kserve-inference: allow ingress on queue-proxy port 8013. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280202 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [08:43:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P92013 and previous config saved to /var/cache/conftool/dbconfig/20260430-084321-fceratto.json [08:47:33] (03Merged) 10jenkins-bot: kserve-inference: allow ingress on queue-proxy port 8013. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280202 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [08:48:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy5003.wikimedia.org with OS bookworm [08:48:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy5003.wikimedia.org [08:48:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11874354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host hcaptcha-proxy5003.wikimedia.org with OS bookworm completed:... [08:49:15] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:49:41] (03PS1) 10DCausse: cirrus-streaming-updater: bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280206 (https://phabricator.wikimedia.org/T424799) [08:49:47] jouncebot: nowandnext [08:49:47] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [08:49:47] In 1 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1000) [08:50:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy5004.wikimedia.org [08:50:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy5004.wikimedia.org [08:50:40] FIRING: ProbeDown: Service etherpad1004:9001 has failed probes (http_etherpad_nodejs_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1004:9001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:51:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy5004.wikimedia.org [08:51:03] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:52:56] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280206 (https://phabricator.wikimedia.org/T424799) (owner: 10DCausse) [08:53:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P92014 and previous config saved to /var/cache/conftool/dbconfig/20260430-085329-fceratto.json [08:54:15] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:54:25] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:54:35] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:54:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5004.wikimedia.org - jmm@cumin2002" [08:55:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5004.wikimedia.org - jmm@cumin2002" [08:55:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:55:05] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy5004.wikimedia.org on all recursors [08:55:05] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280206 (https://phabricator.wikimedia.org/T424799) (owner: 10DCausse) [08:55:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy5004.wikimedia.org on all recursors [08:55:40] RESOLVED: ProbeDown: Service etherpad1004:9001 has failed probes (http_etherpad_nodejs_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1004:9001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:44] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy5004.wikimedia.org - jmm@cumin2002" [08:55:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy5004.wikimedia.org - jmm@cumin2002" [08:56:43] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:56:58] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:57:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy5004.wikimedia.org with OS bookworm [08:57:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11874378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host hcaptcha-proxy5004.wikimedia.org with OS bookworm [09:01:21] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:01:27] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:02:39] (03CR) 10JavierMonton: [C:03+1] alerts: mw-page-html-feature-counts-change-enrich (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun) [09:03:31] (03CR) 10Btullis: [C:03+1] kafka-jumbo: set inter.broker.protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1280078 (https://phabricator.wikimedia.org/T424527) (owner: 10Brouberol) [09:03:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T419961)', diff saved to https://phabricator.wikimedia.org/P92015 and previous config saved to /var/cache/conftool/dbconfig/20260430-090337-fceratto.json [09:03:39] (03CR) 10Brouberol: [C:03+2] kafka-jumbo: set inter.broker.protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1280078 (https://phabricator.wikimedia.org/T424527) (owner: 10Brouberol) [09:04:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [09:04:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T419961)', diff saved to https://phabricator.wikimedia.org/P92016 and previous config saved to /var/cache/conftool/dbconfig/20260430-090408-fceratto.json [09:04:30] (03PS1) 10Phuedx: mw.testKitchen.getExperiment() -> mw.testKitchen.compat.getExperiment() [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280210 (https://phabricator.wikimedia.org/T419513) [09:06:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280210 (https://phabricator.wikimedia.org/T419513) (owner: 10Phuedx) [09:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [09:11:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T419961)', diff saved to https://phabricator.wikimedia.org/P92017 and previous config saved to /var/cache/conftool/dbconfig/20260430-091147-fceratto.json [09:12:56] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [09:15:20] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=97) rolling restart_daemons on A:kafka-jumbo-eqiad [09:17:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [09:18:11] (03CR) 10Daniel Kinzler: [C:04-1] "I think it's just re-ordering. But it's a bit confusing. Would be good to know how to proeprly test these routes before rolling out." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [09:21:54] (03PS1) 10Sergio Gimeno: loggedOutWarning: instrument browser navigation and tab close [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280226 (https://phabricator.wikimedia.org/T421518) [09:21:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P92018 and previous config saved to /var/cache/conftool/dbconfig/20260430-092154-fceratto.json [09:22:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280226 (https://phabricator.wikimedia.org/T421518) (owner: 10Sergio Gimeno) [09:24:28] !log temporarily remove ganeti4006 from the ulsfo02 Ganeti cluster in preparation of forthcoming switch maintenance in ulsfo T424686 [09:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:32] T424686: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686 [09:26:43] PROBLEM - ganeti-noded running on ganeti4006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:27:03] PROBLEM - ganeti-confd running on ganeti4006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:27:21] !log failover Ganeti master in ulsfo02 to ganeti4005 in preparation of forthcoming switch maintenance in ulsfo T424686 [09:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:28:43] PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:32:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P92019 and previous config saved to /var/cache/conftool/dbconfig/20260430-093202-fceratto.json [09:35:35] (03CR) 10Tiziano Fogli: logstash: add thanos-query-frontend filter (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [09:35:51] (03CR) 10Jcrespo: [C:03+1] "The change is ok, but I don't recognize that syntax for regexes on the description." [puppet] - 10https://gerrit.wikimedia.org/r/1280083 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:42:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T419961)', diff saved to https://phabricator.wikimedia.org/P92020 and previous config saved to /var/cache/conftool/dbconfig/20260430-094210-fceratto.json [09:42:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [09:42:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T419961)', diff saved to https://phabricator.wikimedia.org/P92021 and previous config saved to /var/cache/conftool/dbconfig/20260430-094239-fceratto.json [09:43:14] (03PS5) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) [09:43:14] (03PS8) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [09:43:48] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11874459 (10ABran-WMF) 05Open→03Resolved Now that {T286066} is done, and the MX record has been updated: ` ~ $ dig MX lists.wikimedia.org +short 10 lists10... [09:47:23] (03CR) 10Atsuko: [C:03+2] dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [09:50:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T419961)', diff saved to https://phabricator.wikimedia.org/P92022 and previous config saved to /var/cache/conftool/dbconfig/20260430-095000-fceratto.json [09:50:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy5004.wikimedia.org with reason: host reimage [09:50:33] (03PS3) 10Atsuko: dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) [09:50:40] (03CR) 10Filippo Giunchedi: [C:04-1] "Something else that occurred to me re: 2, we could switch to match zookeeper_clusters on hostname rather than fqdn (or try both), though t" [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [09:54:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy5004.wikimedia.org with reason: host reimage [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1000) [10:00:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P92023 and previous config saved to /var/cache/conftool/dbconfig/20260430-100009-fceratto.json [10:01:48] (03PS2) 10MVernon: swift: restore 2 nodes to rings, drain 2 more for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1280083 (https://phabricator.wikimedia.org/T354872) [10:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 4.442% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:02:33] (03CR) 10MVernon: [C:03+2] swift: restore 2 nodes to rings, drain 2 more for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1280083 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [10:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 3h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [10:05:57] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11874500 (10MatthewVernon) [10:06:48] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11874504 (10MatthewVernon) [10:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 9.435% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:10:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P92024 and previous config saved to /var/cache/conftool/dbconfig/20260430-101017-fceratto.json [10:14:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy5004.wikimedia.org with OS bookworm [10:14:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy5004.wikimedia.org [10:14:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11874553 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host hcaptcha-proxy5004.wikimedia.org with OS bookworm completed:... [10:16:59] (03CR) 10Atsuko: [C:03+2] "re-trigger" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [10:20:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T419961)', diff saved to https://phabricator.wikimedia.org/P92025 and previous config saved to /var/cache/conftool/dbconfig/20260430-102026-fceratto.json [10:20:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [10:20:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T419961)', diff saved to https://phabricator.wikimedia.org/P92026 and previous config saved to /var/cache/conftool/dbconfig/20260430-102055-fceratto.json [10:24:52] (03Merged) 10jenkins-bot: dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [10:28:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:28:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T419961)', diff saved to https://phabricator.wikimedia.org/P92027 and previous config saved to /var/cache/conftool/dbconfig/20260430-102830-fceratto.json [10:35:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [10:36:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [10:36:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast5005.wikimedia.org [10:36:27] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:38:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P92028 and previous config saved to /var/cache/conftool/dbconfig/20260430-103838-fceratto.json [10:39:28] !log Clearing stuck Test Kitchen experiment configs value from codfw local cluster cache [10:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:24] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5005.wikimedia.org - jmm@cumin2002" [10:40:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5005.wikimedia.org - jmm@cumin2002" [10:40:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:44] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast5005.wikimedia.org on all recursors [10:40:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast5005.wikimedia.org on all recursors [10:41:18] (03PS1) 10Marostegui: db2205: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1280280 [10:41:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast5005.wikimedia.org - jmm@cumin2002" [10:41:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast5005.wikimedia.org - jmm@cumin2002" [10:41:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast5005.wikimedia.org with OS trixie [10:41:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [10:42:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [10:42:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11874617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast5005.wikimedia.org with OS trixie [10:42:24] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:46:20] (03CR) 10Marostegui: [C:03+2] db2205: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1280280 (owner: 10Marostegui) [10:46:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2205.codfw.wmnet with reason: Reimage to Trixie [10:47:01] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2205: Reimage to Trixie [10:47:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2205: Reimage to Trixie [10:48:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P92030 and previous config saved to /var/cache/conftool/dbconfig/20260430-104846-fceratto.json [10:48:54] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2205.codfw.wmnet with OS trixie [10:49:37] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1273792 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:52:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [10:53:47] (03PS3) 10Muehlenhoff: http-sso-django-login: Switch to firewall::service and restrict access [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) [10:58:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T419961)', diff saved to https://phabricator.wikimedia.org/P92031 and previous config saved to /var/cache/conftool/dbconfig/20260430-105854-fceratto.json [10:59:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [10:59:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T419961)', diff saved to https://phabricator.wikimedia.org/P92032 and previous config saved to /var/cache/conftool/dbconfig/20260430-105924-fceratto.json [10:59:55] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:00:29] (03CR) 10Muehlenhoff: http-sso-django-login: Switch to firewall::service and restrict access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [11:00:51] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:01:17] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [11:02:10] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:06:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T419961)', diff saved to https://phabricator.wikimedia.org/P92033 and previous config saved to /var/cache/conftool/dbconfig/20260430-110640-fceratto.json [11:06:57] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2205.codfw.wmnet with reason: host reimage [11:10:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2205.codfw.wmnet with reason: host reimage [11:13:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be209[7,8] - https://phabricator.wikimedia.org/T424892#11874684 (10MatthewVernon) a:05MatthewVernon→03None No changes needed for this - modules/install_server/files/autoinstall/scripts/partman_early_comman... [11:15:42] (03PS1) 10JMeybohm: Update rsyslog image to trixie and rsyslog 8.2504.0-1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1280313 (https://phabricator.wikimedia.org/T418200) [11:16:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P92034 and previous config saved to /var/cache/conftool/dbconfig/20260430-111648-fceratto.json [11:18:23] (03PS1) 10JMeybohm: Bump default rsyslog container version to 8.2504.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/1280317 (https://phabricator.wikimedia.org/T418200) [11:18:26] (03CR) 10Elukey: [C:03+2] sre.hosts: fix ipmi() calls after spicerack 12.5.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1279379 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [11:18:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11874719 (10MoritzMuehlenhoff) [11:19:22] !log upgrade spicerack on cumin hosts to 12.5.0 [11:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:52] !log installing policykit-1 security updates [11:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:56] (03PS1) 10JMeybohm: Test updated rsyslog image on mw-experimental and mw-web canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280324 (https://phabricator.wikimedia.org/T418200) [11:26:19] (03PS1) 10Marostegui: Revert "db2205: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1280333 [11:26:37] (03PS2) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) [11:26:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P92035 and previous config saved to /var/cache/conftool/dbconfig/20260430-112656-fceratto.json [11:27:07] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1039.eqiad.wmnet [11:27:09] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1039.eqiad.wmnet [11:27:11] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11874759 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1003 pool for host wikikube-w... [11:27:15] (03CR) 10Marostegui: [C:03+2] Revert "db2205: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1280333 (owner: 10Marostegui) [11:27:31] !log jayme@cumin1003 START - Cookbook sre.hosts.remove-downtime for wikikube-worker1039.eqiad.wmnet [11:27:31] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker1039.eqiad.wmnet [11:27:48] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:28:52] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:31:29] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1376:9290 - https://phabricator.wikimedia.org/T424917#11874781 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [11:33:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2205.codfw.wmnet with OS trixie [11:34:44] (03CR) 10Blake: [C:03+1] Bump default rsyslog container version to 8.2504.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/1280317 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [11:35:29] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:35:35] (03CR) 10Blake: "It doesn't look like there's anything in this CR explicitly updating the version to 8.2504.0-1, is that happening implicitly somehow?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1280313 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [11:35:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11874800 (10VRiley-WMF) [11:35:58] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:36:08] (03CR) 10Blake: [C:03+1] Test updated rsyslog image on mw-experimental and mw-web canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280324 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [11:36:14] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:37:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T419961)', diff saved to https://phabricator.wikimedia.org/P92036 and previous config saved to /var/cache/conftool/dbconfig/20260430-113704-fceratto.json [11:38:27] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:39:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [11:39:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2205: after reimage to trixie [11:39:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P92038 and previous config saved to /var/cache/conftool/dbconfig/20260430-113910-fceratto.json [11:39:16] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:39:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance [11:39:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2203 (T419961)', diff saved to https://phabricator.wikimedia.org/P92039 and previous config saved to /var/cache/conftool/dbconfig/20260430-113948-fceratto.json [11:40:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P92040 and previous config saved to /var/cache/conftool/dbconfig/20260430-114020-fceratto.json [11:40:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast5005.wikimedia.org with reason: host reimage [11:42:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:44:50] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1377.eqiad.wmnet with OS trixie [11:45:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast5005.wikimedia.org with reason: host reimage [11:45:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11874825 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [11:45:11] (03PS1) 10Muehlenhoff: Assign the hcaptcha::proxy role to hcaptcha-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1280353 (https://phabricator.wikimedia.org/T421863) [11:45:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11874840 (10MoritzMuehlenhoff) [11:46:03] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:46:26] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:47:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T419961)', diff saved to https://phabricator.wikimedia.org/P92041 and previous config saved to /var/cache/conftool/dbconfig/20260430-114703-fceratto.json [11:47:11] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@fb711fc] (releasing): Update backup releases Jenkins [11:47:34] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@fb711fc] (releasing): Update backup releases Jenkins (duration: 00m 33s) [11:49:33] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@fb711fc] (releasing): Update production releases Jenkins [11:50:22] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@fb711fc] (releasing): Update production releases Jenkins (duration: 01m 04s) [11:50:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P92042 and previous config saved to /var/cache/conftool/dbconfig/20260430-115028-fceratto.json [11:56:43] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1377.eqiad.wmnet with reason: host reimage [11:57:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P92044 and previous config saved to /var/cache/conftool/dbconfig/20260430-115712-fceratto.json [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1200) [12:00:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P92045 and previous config saved to /var/cache/conftool/dbconfig/20260430-120036-fceratto.json [12:00:44] (03CR) 10Cathal Mooney: Add BGP peering from asw1-23 to core routers and mr1 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [12:00:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1377.eqiad.wmnet with reason: host reimage [12:02:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11874964 (10MoritzMuehlenhoff) [12:03:00] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [12:04:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:04:53] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1378.eqiad.wmnet with OS trixie [12:05:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11874988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [12:05:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast5005.wikimedia.org with OS trixie [12:05:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast5005.wikimedia.org [12:05:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11874989 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast5005.wikimedia.org with OS trixie completed: - bast5005... [12:07:02] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11874991 (10neriah) >>! In T414805#11874323, @Nux wrote: > And that is on top of WMF staff already making interface edits harder by f... [12:07:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P92046 and previous config saved to /var/cache/conftool/dbconfig/20260430-120720-fceratto.json [12:07:21] (03PS1) 10Urbanecm: ReassignMentees: Add logging information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280368 (https://phabricator.wikimedia.org/T418194) [12:08:26] (03PS1) 10Urbanecm: ReassignMentees: Add logging information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1280370 (https://phabricator.wikimedia.org/T418194) [12:08:39] cmooney@cumin1003 netbox (PID 2991460) is awaiting input [12:08:43] jouncebot: nowandnext [12:08:43] For the next 0 hour(s) and 51 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1200) [12:08:43] In 0 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1300) [12:08:51] (03CR) 10Urbanecm: [C:03+2] ReassignMentees: Add logging information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280368 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:08:58] (03CR) 10Urbanecm: [C:03+2] ReassignMentees: Add logging information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1280370 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:09:38] (03PS1) 10MVernon: swift: prep for ms-be11* [puppet] - 10https://gerrit.wikimedia.org/r/1280373 (https://phabricator.wikimedia.org/T424895) [12:09:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11875018 (10MoritzMuehlenhoff) [12:10:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P92048 and previous config saved to /var/cache/conftool/dbconfig/20260430-121044-fceratto.json [12:10:50] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:11:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:11:10] !log installing gdk-pixbuf security updates [12:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:11:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install ms-be1098, ms-be1099, ms-be1100 - https://phabricator.wikimedia.org/T424895#11875047 (10MatthewVernon) [12:11:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P92049 and previous config saved to /var/cache/conftool/dbconfig/20260430-121130-fceratto.json [12:11:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280368 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:11:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1280370 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:12:35] (03CR) 10AikoChou: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:13:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P92050 and previous config saved to /var/cache/conftool/dbconfig/20260430-121340-fceratto.json [12:16:04] (03PS1) 10Muehlenhoff: Add durum5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1280375 (https://phabricator.wikimedia.org/T421863) [12:16:20] (03CR) 10AikoChou: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:17:15] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:17:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T419961)', diff saved to https://phabricator.wikimedia.org/P92051 and previous config saved to /var/cache/conftool/dbconfig/20260430-121728-fceratto.json [12:17:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [12:17:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T419961)', diff saved to https://phabricator.wikimedia.org/P92052 and previous config saved to /var/cache/conftool/dbconfig/20260430-121758-fceratto.json [12:18:40] (03Merged) 10jenkins-bot: ReassignMentees: Add logging information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280368 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:18:46] (03Merged) 10jenkins-bot: ReassignMentees: Add logging information [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1280370 (https://phabricator.wikimedia.org/T418194) (owner: 10Urbanecm) [12:19:15] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1280368|ReassignMentees: Add logging information (T418194)]], [[gerrit:1280370|ReassignMentees: Add logging information (T418194)]] [12:19:19] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [12:19:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir4003.ulsfo.wmnet to plain [12:20:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir4003.ulsfo.wmnet to plain [12:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 8.023% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:20:21] jclark@cumin1003 reimage (PID 2977448) is awaiting input [12:21:06] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1280368|ReassignMentees: Add logging information (T418194)]], [[gerrit:1280370|ReassignMentees: Add logging information (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:21:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir4004.ulsfo.wmnet to plain [12:22:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [12:23:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir4004.ulsfo.wmnet to plain [12:23:45] !log urbanecm@deploy1003 urbanecm: Continuing with deployment [12:23:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P92053 and previous config saved to /var/cache/conftool/dbconfig/20260430-122348-fceratto.json [12:24:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2205: after reimage to trixie [12:25:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T419961)', diff saved to https://phabricator.wikimedia.org/P92055 and previous config saved to /var/cache/conftool/dbconfig/20260430-122516-fceratto.json [12:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.442s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:26:16] something's wrong with scap... [12:28:12] jhathaway: elukey: https://spiderpig.wikimedia.org/jobs/1862 says something about a missing values.yaml file, but i can't get the page with logs loaded... [12:28:20] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11875160 (10A_smart_kitten) >>! In T414805#11873989, @Ladsgroup wrote: >>>! In T414805#11873042, @Nux wrote: >> >> There are still l... [12:28:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus4003.ulsfo.wmnet to plain [12:28:31] (03CR) 10ArielGlenn: rest gateway: rate limits for liftwing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [12:28:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus4003.ulsfo.wmnet to plain [12:29:47] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct typo with reverse for mr1-ulsfo address - cmooney@cumin1003" [12:31:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [12:32:17] (03CR) 10Fabfur: [C:03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/1279402 (https://phabricator.wikimedia.org/T424785) (owner: 10CDobbins) [12:32:26] urbanecm: o/ [12:32:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow4003.ulsfo.wmnet to plain [12:32:53] cmooney@cumin1003 netbox (PID 2991460) is awaiting input [12:32:55] elukey: scap failed due to $reasons, and https://spiderpig.wikimedia.org/jobs/1862#log refuses to load :/ [12:33:50] yeah same for me, my browser crashes [12:33:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P92056 and previous config saved to /var/cache/conftool/dbconfig/20260430-123356-fceratto.json [12:33:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct typo with reverse for mr1-ulsfo address - cmooney@cumin1003" [12:33:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:34:15] (03CR) 10Jcrespo: [C:03+1] swift: prep for ms-be11* [puppet] - 10https://gerrit.wikimedia.org/r/1280373 (https://phabricator.wikimedia.org/T424895) (owner: 10MVernon) [12:34:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow4003.ulsfo.wmnet to plain [12:34:52] not cool. on the mainpage, i can tell it to either retry or abort, but w/o seeing the logs, i have no clue what makes more sense :/ [12:35:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P92057 and previous config saved to /var/cache/conftool/dbconfig/20260430-123524-fceratto.json [12:35:47] (03PS3) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) [12:35:53] urbanecm: those logs should be somewhere, lemme check [12:36:07] i see something in https://logstash.wikimedia.org/goto/05b61171fa7033fc1a1fd7bcbe139de4 [12:36:13] trying to get to the actual error [12:36:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.055s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:37:09] Deployment of mw-cron-main-eqiad failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. [12:37:56] !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:37:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1377.eqiad.wmnet with OS trixie [12:38:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11875244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [12:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 2.755% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:40:23] urbanecm: I am a bit ignorant about spiderpig, maybe we could ping somebody from releng? [12:40:26] oh noes [12:40:36] is that your deployment? [12:40:49] depending on where it left things [12:41:09] (03CR) 10MVernon: [C:03+2] swift: prep for ms-be11* [puppet] - 10https://gerrit.wikimedia.org/r/1280373 (https://phabricator.wikimedia.org/T424895) (owner: 10MVernon) [12:41:10] (03PS1) 10STran: Fix incorrect source in back instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280386 (https://phabricator.wikimedia.org/T424075) [12:41:15] it could be, as it failed somewhere in the middle [12:41:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280386 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [12:41:39] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy the latest version of rr-multilingual model server on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:41:49] mmm not sure, the graph looks bad [12:42:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install ms-be1098, ms-be1099, ms-be1100 - https://phabricator.wikimedia.org/T424895#11875274 (10MatthewVernon) a:05MatthewVernon→03None [12:42:42] it also predates my deployment by a few mins [12:42:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.517s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:43:42] (03Merged) 10jenkins-bot: ml-services: Deploy the latest version of rr-multilingual model server on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:44:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T419635)', diff saved to https://phabricator.wikimedia.org/P92058 and previous config saved to /var/cache/conftool/dbconfig/20260430-124405-fceratto.json [12:44:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:44:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [12:44:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P92059 and previous config saved to /var/cache/conftool/dbconfig/20260430-124429-fceratto.json [12:45:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P92060 and previous config saved to /var/cache/conftool/dbconfig/20260430-124532-fceratto.json [12:45:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P92061 and previous config saved to /var/cache/conftool/dbconfig/20260430-124539-fceratto.json [12:46:01] (03CR) 10Bearloga: EventStreamConfig: remove ABST contextual attribute (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) (owner: 10Bearloga) [12:47:14] (03PS2) 10Bearloga: EventStreamConfig: remove ABST contextual attribute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) [12:47:47] (03CR) 10Bearloga: EventStreamConfig: remove ABST contextual attribute (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) (owner: 10Bearloga) [12:48:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654#11875306 (10Jclark-ctr) Both drives have been Swapped [12:49:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of hcaptcha-proxy4003.wikimedia.org to plain [12:50:14] (03CR) 10Phuedx: [C:03+1] EventStreamConfig: remove ABST contextual attribute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) (owner: 10Bearloga) [12:50:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 2.755% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:50:32] (03CR) 10Muehlenhoff: [C:03+2] Add durum5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1280375 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [12:50:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of hcaptcha-proxy4003.wikimedia.org to plain [12:50:46] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:50:56] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:50:59] (03CR) 10Cathal Mooney: Add BGP peering from asw1-23 to core routers and mr1 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [12:51:03] (03CR) 10Cathal Mooney: [C:03+1] Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [12:51:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [12:52:46] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 801.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:53:21] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:54:21] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:55:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T419961)', diff saved to https://phabricator.wikimedia.org/P92062 and previous config saved to /var/cache/conftool/dbconfig/20260430-125540-fceratto.json [12:55:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P92063 and previous config saved to /var/cache/conftool/dbconfig/20260430-125547-fceratto.json [12:59:26] (03CR) 10Dpogorzelski: [C:03+2] changeprop: Configure RevertRisk multilingual model on changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:59:59] !log dpogorzelski@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1300) [13:00:05] cscott, phuedx, Sergi0, and Tran: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:20] o/ [13:00:37] !log dpogorzelski@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [13:00:38] !log temporarily remove ganeti4008 from the ulsfo02 Ganeti cluster in preparation of forthcoming switch maintenance in ulsfo T424686 [13:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:44] T424686: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686 [13:00:52] 06SRE: Please add Google Search Console domain verification for wikimediafoundation.org - https://phabricator.wikimedia.org/T424976 (10SCherukuwada) 03NEW [13:00:57] o/ [13:01:00] !log dpogorzelski@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: sync [13:01:25] !log dpogorzelski@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [13:01:35] phuedx's and my patches go together so I'll be deploying in his stead. [13:01:35] i can spiderpig my patch, and as it's a config patch it should be pretty fast. [13:01:46] looks like the rest of you have "real" backports [13:02:31] cscott: spiderpig is currently stalled on urbanecm’s job (good luck, dude!) [13:02:55] PROBLEM - ganeti-confd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:03:15] oh, are we in line behind urbanecm ? [13:03:27] PROBLEM - ganeti-noded running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:03:35] It would appear so, yeah [13:04:07] (03PS3) 10STran: Add exposure for experiment instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280387 (https://phabricator.wikimedia.org/T424075) [13:04:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280387 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:05:33] cscott: there is an incident as well [13:05:49] oh, fun. [13:05:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P92064 and previous config saved to /var/cache/conftool/dbconfig/20260430-130556-fceratto.json [13:06:00] bearloga: and that job is blocked on 'something wrong in logs and logs are inaccessible so i have no idea where to go' [13:06:24] yeah i found [(1) ⚓ T424975 Certain deployment logs cause Spiderpig to crash the browser](https://phabricator.wikimedia.org/T424975#11875286) [13:06:24] T424975: Certain deployment logs cause Spiderpig to crash the browser - https://phabricator.wikimedia.org/T424975 [13:06:36] Yep [13:06:37] urbanecm: I tried opening the logs and saw some of them but then the tab went unresponsive [13:06:53] anyway, i'll be here if/when things get rolling again, and if not I guess i can cross my fingers for the late backport window [13:07:01] Sending positive thoughts your way [13:07:03] bearloga: exactly my problem [13:07:32] 06SRE, 06Infrastructure-Foundations: Review the most critical/popular Kafka clients before the Kafka upgrade - https://phabricator.wikimedia.org/T417031#11875360 (10brouberol) This has been handled out of band, and is no longer necessary to keep open now that we're performing the upgrade (or have done so, depe... [13:07:37] 06SRE, 06Infrastructure-Foundations: Review the most critical/popular Kafka clients before the Kafka upgrade - https://phabricator.wikimedia.org/T417031#11875362 (10brouberol) 05Open→03Resolved a:03brouberol [13:07:42] 06SRE, 06Infrastructure-Foundations: Add some kafka clients to the Kafka test cluster - https://phabricator.wikimedia.org/T417034#11875365 (10brouberol) 05Open→03Resolved a:03brouberol This has been handled out of band, and is no longer necessary to keep open now that we're performing the upgrade (or... [13:07:53] 06SRE, 06Infrastructure-Foundations: Add some kafka clients to the Kafka test cluster - https://phabricator.wikimedia.org/T417034#11875369 (10brouberol) a:05brouberol→03None [13:08:00] 06SRE, 06Infrastructure-Foundations: Review the most critical/popular Kafka clients before the Kafka upgrade - https://phabricator.wikimedia.org/T417031#11875370 (10brouberol) a:05brouberol→03None [13:08:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [13:09:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:12:25] (03Abandoned) 10ZhaoFJx: arbcom_zhwiki: Add electionadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248954 (https://phabricator.wikimedia.org/T419309) (owner: 10ZhaoFJx) [13:12:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [13:12:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T419961)', diff saved to https://phabricator.wikimedia.org/P92065 and previous config saved to /var/cache/conftool/dbconfig/20260430-131249-fceratto.json [13:16:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T419635)', diff saved to https://phabricator.wikimedia.org/P92066 and previous config saved to /var/cache/conftool/dbconfig/20260430-131604-fceratto.json [13:16:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:16:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [13:16:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P92067 and previous config saved to /var/cache/conftool/dbconfig/20260430-131629-fceratto.json [13:17:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P92068 and previous config saved to /var/cache/conftool/dbconfig/20260430-131739-fceratto.json [13:20:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:21:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T419961)', diff saved to https://phabricator.wikimedia.org/P92069 and previous config saved to /var/cache/conftool/dbconfig/20260430-132114-fceratto.json [13:21:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:24:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279477 (https://phabricator.wikimedia.org/T424898) (owner: 10VadymTS1) [13:25:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:27:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P92070 and previous config saved to /var/cache/conftool/dbconfig/20260430-132747-fceratto.json [13:28:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247186 (https://phabricator.wikimedia.org/T418815) (owner: 10MGChecker) [13:31:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P92071 and previous config saved to /var/cache/conftool/dbconfig/20260430-133122-fceratto.json [13:32:49] (03CR) 10Muehlenhoff: [C:03+1] "The patch looks good and this seems fine for the initial deployment. qlever seems like a rather straightforward C++ application with sensi" [puppet] - 10https://gerrit.wikimedia.org/r/1278479 (https://phabricator.wikimedia.org/T424340) (owner: 10Btullis) [13:34:35] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:35:08] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:35:22] (03PS1) 10Zabe: Add script to fix fr_deleted drifts [extensions/WikimediaMaintenance] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280417 (https://phabricator.wikimedia.org/T424553) [13:37:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P92072 and previous config saved to /var/cache/conftool/dbconfig/20260430-133756-fceratto.json [13:39:27] (03PS1) 10Zabe: Start reading from new file tables on testwiki (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280418 (https://phabricator.wikimedia.org/T416548) [13:41:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P92073 and previous config saved to /var/cache/conftool/dbconfig/20260430-134130-fceratto.json [13:44:07] (03CR) 10Zabe: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280418 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [13:45:33] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:45:49] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:48:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T419635)', diff saved to https://phabricator.wikimedia.org/P92074 and previous config saved to /var/cache/conftool/dbconfig/20260430-134804-fceratto.json [13:48:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:48:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [13:48:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P92075 and previous config saved to /var/cache/conftool/dbconfig/20260430-134829-fceratto.json [13:50:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11875571 (10MoritzMuehlenhoff) [13:50:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P92076 and previous config saved to /var/cache/conftool/dbconfig/20260430-135040-fceratto.json [13:51:17] (03CR) 10Btullis: [C:03+2] Add packages.qlever.org to reprepro as thirdparty/qlever [puppet] - 10https://gerrit.wikimedia.org/r/1278479 (https://phabricator.wikimedia.org/T424340) (owner: 10Btullis) [13:51:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T419961)', diff saved to https://phabricator.wikimedia.org/P92077 and previous config saved to /var/cache/conftool/dbconfig/20260430-135139-fceratto.json [13:52:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1195.eqiad.wmnet with reason: Maintenance [13:52:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T419961)', diff saved to https://phabricator.wikimedia.org/P92078 and previous config saved to /var/cache/conftool/dbconfig/20260430-135207-fceratto.json [13:54:01] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging2005.codfw.wmnet with OS trixie [13:54:29] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host kafka-logging2005 [13:57:32] herron@cumin1003 reimage (PID 3049150) is awaiting input [14:00:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T419961)', diff saved to https://phabricator.wikimedia.org/P92079 and previous config saved to /var/cache/conftool/dbconfig/20260430-140030-fceratto.json [14:00:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:00:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P92080 and previous config saved to /var/cache/conftool/dbconfig/20260430-140048-fceratto.json [14:02:34] (03PS1) 10Elukey: Add Wikifunctions' evaluator ingress endpoints to service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1280433 (https://phabricator.wikimedia.org/T424193) [14:02:36] (03PS1) 10Elukey: Turn Wikifunctions evaluator endpoints to production state [puppet] - 10https://gerrit.wikimedia.org/r/1280434 (https://phabricator.wikimedia.org/T424193) [14:02:39] (03PS1) 10Elukey: profile::services_proxy::envoy: add wikifunctions eval endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1280435 (https://phabricator.wikimedia.org/T424193) [14:03:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11875647 (10VRiley-WMF) [14:04:33] (03PS3) 10Herron: kafka-logging2005: update IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/1280431 (https://phabricator.wikimedia.org/T421712) [14:05:39] !log herron@cumin1003 START - Cookbook sre.dns.netbox [14:06:42] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280441 [14:08:30] (03PS1) 10Gkyziridis: ml-services: Deploy hotfix revertrisk-multilingual on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280442 [14:09:20] (03CR) 10Trueg: "Our Gitlab pipeline (https://gitlab.wikimedia.org/repos/wikidata-platform/wdqs/wdqs-qlever/-/blob/main/.gitlab-ci.yml) already contains a " [puppet] - 10https://gerrit.wikimedia.org/r/1278479 (https://phabricator.wikimedia.org/T424340) (owner: 10Btullis) [14:10:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P92081 and previous config saved to /var/cache/conftool/dbconfig/20260430-141038-fceratto.json [14:10:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P92082 and previous config saved to /var/cache/conftool/dbconfig/20260430-141057-fceratto.json [14:11:03] !log herron@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-logging2005 - herron@cumin1003" [14:11:09] !log herron@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-logging2005 - herron@cumin1003" [14:11:09] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:11:09] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache kafka-logging2005.codfw.wmnet 85.48.192.10.in-addr.arpa 5.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:11:13] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafka-logging2005.codfw.wmnet 85.48.192.10.in-addr.arpa 5.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:11:14] !log herron@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging2005 [14:12:21] !log herron@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging2005 [14:12:21] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-logging2005 [14:12:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236361 (https://phabricator.wikimedia.org/T416174) (owner: 10Seawolf35gerrit) [14:12:40] 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11875684 (10phaultfinder) [14:18:06] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11875703 (10daniel) >>! In T414805#11875160, @A_smart_kitten wrote: > One potential difference that comes to mind is that -- potentia... [14:20:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P92083 and previous config saved to /var/cache/conftool/dbconfig/20260430-142046-fceratto.json [14:21:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T419635)', diff saved to https://phabricator.wikimedia.org/P92084 and previous config saved to /var/cache/conftool/dbconfig/20260430-142105-fceratto.json [14:21:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:21:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [14:21:51] (03CR) 10AKhatun: [C:03+2] alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun) [14:22:42] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1378.eqiad.wmnet with OS trixie [14:22:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11875740 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [14:23:33] (03Merged) 10jenkins-bot: alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun) [14:23:59] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:24:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:24:04] I'm investigating the SpiderPig problem [14:24:54] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:25:13] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:25:44] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11875752 (10Gehel) We also have some general documentation of availability expectation... [14:25:49] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:26:08] (03PS1) 10VadymTS1: [eswiktionary] Switch $wgSignatureValidation to 'disallow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280449 (https://phabricator.wikimedia.org/T424983) [14:26:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [14:26:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1230 (T419635)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260430-142639-fceratto.json [14:26:53] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:27:18] dancy: if you could ping me when things are back to normal, i'd like to stage two patches with a couple hours gap between them for cache mitigation reasons, so i'd really like to get a patch deployed "a couple of hours before" the late backport window today [14:27:29] cscott: Will do. [14:27:29] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11875764 (10Jclark-ctr) Talked to @Jhancock.wm same issues with imaging eqiad servers T418916 [14:27:33] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2005.codfw.wmnet with reason: host reimage [14:28:05] dancy: thanks! good luck on your expedition to the bug caves [14:28:26] (or is it spiderpig sty?) [14:28:47] do spiderpigs live in a sty, like pigs, or a web, like spiders? [14:28:48] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11875768 (10MLechvien-WMF) Thanks all for the work on this! @kostajh as you were origin... [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1430) [14:30:46] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1280368|ReassignMentees: Add logging information (T418194)]], [[gerrit:1280370|ReassignMentees: Add logging information (T418194)]] (duration: 131m 31s) [14:30:52] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [14:30:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T419961)', diff saved to https://phabricator.wikimedia.org/P92086 and previous config saved to /var/cache/conftool/dbconfig/20260430-143054-fceratto.json [14:31:14] !log dancy@deploy1003 Installing scap version "4.252.0" for 2 host(s) [14:31:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [14:31:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:31:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T419961)', diff saved to https://phabricator.wikimedia.org/P92087 and previous config saved to /var/cache/conftool/dbconfig/20260430-143143-fceratto.json [14:31:51] PROBLEM - Host kafka-logging2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:26] ^^ that's me reimaging [14:33:00] !log dancy@deploy1003 Installation of scap version "4.252.0" completed for 2 hosts [14:33:04] (03CR) 10Muehlenhoff: [C:03+1] "Sounds good, the base libraries of gnutls are already universally installed anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1279491 (https://phabricator.wikimedia.org/T424672) (owner: 10Bking) [14:33:31] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2005.codfw.wmnet with reason: host reimage [14:34:33] !log dancy@deploy1003 Installing scap version "4.255.0" for 2 host(s) [14:34:36] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1378.eqiad.wmnet with reason: host reimage [14:35:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280449 (https://phabricator.wikimedia.org/T424983) (owner: 10VadymTS1) [14:36:14] !log dancy@deploy1003 Installation of scap version "4.255.0" completed for 2 hosts [14:36:35] cscott: Go ahead with your deployment. [14:36:52] RECOVERY - Host kafka-logging2005 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [14:37:18] (03CR) 10Bking: [C:03+2] cumin: install gnutls-bin package [puppet] - 10https://gerrit.wikimedia.org/r/1279491 (https://phabricator.wikimedia.org/T424672) (owner: 10Bking) [14:37:52] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1378.eqiad.wmnet with reason: host reimage [14:38:28] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy hotfix revertrisk-multilingual on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280442 (owner: 10Gkyziridis) [14:40:34] (03Merged) 10jenkins-bot: ml-services: Deploy hotfix revertrisk-multilingual on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280442 (owner: 10Gkyziridis) [14:40:44] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11875866 (10RobH) >>! In T408892#11873637, @Papaul wrote: > @RobH Remote hands instructions are ready @ https://docs.google.com/document/d/1EW6hxHCQjXPy1PXQWlu... [14:41:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T419961)', diff saved to https://phabricator.wikimedia.org/P92088 and previous config saved to /var/cache/conftool/dbconfig/20260430-144105-fceratto.json [14:41:18] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:41:25] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:41:50] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:42:25] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:42:26] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:45:48] !log installing pdns security updates [14:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:46] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [14:49:39] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v1.3.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280441 (owner: 10Santiago Faci) [14:50:39] dancy: thanks! [14:50:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:51:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P92089 and previous config saved to /var/cache/conftool/dbconfig/20260430-145112-fceratto.json [14:51:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279453 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [14:52:30] (03Merged) 10jenkins-bot: Increase Parsoid Read Views to 60% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279453 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [14:53:00] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1279453|Increase Parsoid Read Views to 60% of enwiki mobile web traffic (T424880)]] [14:53:05] T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880 [14:53:36] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [14:54:04] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:54:55] !log cscott@deploy1003 cscott: Backport for [[gerrit:1279453|Increase Parsoid Read Views to 60% of enwiki mobile web traffic (T424880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:55:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:55:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1378.eqiad.wmnet with OS trixie [14:55:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11875934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cu... [14:56:36] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.3.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280441 (owner: 10Santiago Faci) [14:57:50] (03CR) 10Dzahn: [C:03+2] admin: extend expiry_date for sarmbruster by 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1279482 (https://phabricator.wikimedia.org/T424402) (owner: 10Dzahn) [14:58:09] (03PS1) 10Eevans: linked-artifacts: deploy hoarde v1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280469 (https://phabricator.wikimedia.org/T424545) [14:58:35] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280441 (owner: 10Santiago Faci) [14:58:42] (03PS2) 10Dzahn: admin: extend expiry_date for sarmbruster by 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1279482 (https://phabricator.wikimedia.org/T424402) [14:59:05] herron: o/ if you are reimaging a kafka 3.7 node to trixie, keep https://wikitech.wikimedia.org/wiki/Kafka/Administration#Upgrade_to_Debian_Trixie in mind [14:59:34] elukey: thanks was just working on this part [14:59:39] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:59:53] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:00:12] (03CR) 10Dzahn: "rebased to nothing - because done in 8897a46aae1185bd" [puppet] - 10https://gerrit.wikimedia.org/r/1279482 (https://phabricator.wikimedia.org/T424402) (owner: 10Dzahn) [15:00:25] (03Abandoned) 10Dzahn: admin: extend expiry_date for sarmbruster by 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1279482 (https://phabricator.wikimedia.org/T424402) (owner: 10Dzahn) [15:01:12] (03CR) 10Eevans: [C:03+2] linked-artifacts: deploy hoarde v1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280469 (https://phabricator.wikimedia.org/T424545) (owner: 10Eevans) [15:01:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P92091 and previous config saved to /var/cache/conftool/dbconfig/20260430-150120-fceratto.json [15:01:30] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Extend wmde/nda LDAP access for Sarmbruster - https://phabricator.wikimedia.org/T424402#11875959 (10Dzahn) already done by Moritz with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1280055 [15:02:22] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Extend wmde/nda LDAP access for Sarmbruster - https://phabricator.wikimedia.org/T424402#11875974 (10Dzahn) 05In progress→03Resolved a:03MoritzMuehlenhoff [15:02:29] !log cscott@deploy1003 cscott: Continuing with deployment [15:03:12] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:03:20] (03Merged) 10jenkins-bot: linked-artifacts: deploy hoarde v1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280469 (https://phabricator.wikimedia.org/T424545) (owner: 10Eevans) [15:04:23] (03CR) 10Dzahn: [C:03+2] zuul: Upgrade to Zuul 14.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1279500 (https://phabricator.wikimedia.org/T424879) (owner: 10Dduvall) [15:04:43] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:05:21] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [15:05:36] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [15:06:16] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279453|Increase Parsoid Read Views to 60% of enwiki mobile web traffic (T424880)]] (duration: 13m 15s) [15:06:20] T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880 [15:06:34] dancy: ok, i'm done now. thanks! [15:06:41] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:06:49] (03PS3) 10Herron: kafka-logging2005: use jdk 21 in trixie [puppet] - 10https://gerrit.wikimedia.org/r/1280467 (https://phabricator.wikimedia.org/T417001) [15:07:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bearloga@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) (owner: 10Bearloga) [15:08:33] (03Merged) 10jenkins-bot: EventStreamConfig: remove ABST contextual attribute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) (owner: 10Bearloga) [15:08:56] !log bearloga@deploy1003 Started scap sync-world: Backport for [[gerrit:1270454|EventStreamConfig: remove ABST contextual attribute (T422001)]] [15:09:01] T422001: '.performer.active_browsing_session_token' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T422001 [15:10:51] !log bearloga@deploy1003 bearloga: Backport for [[gerrit:1270454|EventStreamConfig: remove ABST contextual attribute (T422001)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:11:01] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2005.codfw.wmnet with OS trixie [15:11:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T419961)', diff saved to https://phabricator.wikimedia.org/P92092 and previous config saved to /var/cache/conftool/dbconfig/20260430-151128-fceratto.json [15:11:42] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280479 [15:11:49] !log bearloga@deploy1003 bearloga: Continuing with deployment [15:11:50] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [15:11:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T419961)', diff saved to https://phabricator.wikimedia.org/P92093 and previous config saved to /var/cache/conftool/dbconfig/20260430-151157-fceratto.json [15:14:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:16:22] !log bearloga@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270454|EventStreamConfig: remove ABST contextual attribute (T422001)]] (duration: 07m 25s) [15:16:28] T422001: '.performer.active_browsing_session_token' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T422001 [15:20:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T419961)', diff saved to https://phabricator.wikimedia.org/P92094 and previous config saved to /var/cache/conftool/dbconfig/20260430-152011-fceratto.json [15:20:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:20:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [15:22:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11876109 (10elukey) Deployed the spicerack changes, now I am testing https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1266257 to bypass the roo... [15:25:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:25:20] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11876119 (10A_smart_kitten) >>! In T414805#11875703, @daniel wrote: > APIs are maintained as stable interfaces, their evolution is su... [15:25:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [15:25:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11876121 (10Jclark-ctr) 05Open→03Resolved [15:28:24] (03PS3) 10C. Scott Ananian: Increase Parsoid Read Views to 100% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279454 (https://phabricator.wikimedia.org/T424880) [15:28:24] (03PS1) 10C. Scott Ananian: Enable Parsoid postprocessing cache on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280491 (https://phabricator.wikimedia.org/T424880) [15:29:48] dancy: is the window still clear? turns out i need a follow up to the patch I just deployed: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1280491) [15:30:00] Yep. [15:30:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P92095 and previous config saved to /var/cache/conftool/dbconfig/20260430-153019-fceratto.json [15:30:29] ok, i'm going to spiderpig that patch out if that's ok. [15:30:39] OK with me. [15:31:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280491 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [15:33:05] (03Merged) 10jenkins-bot: Enable Parsoid postprocessing cache on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280491 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [15:33:32] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1280491|Enable Parsoid postprocessing cache on enwiki (T424880)]] [15:33:39] T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880 [15:35:25] !log cscott@deploy1003 cscott: Backport for [[gerrit:1280491|Enable Parsoid postprocessing cache on enwiki (T424880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:37:54] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11876203 (10RobH) >>! In T408892#11875866, @RobH wrote: >>>! In T408892#11873637, @Papaul wrote: >> @RobH Remote hands instructions are ready @ https://docs.go... [15:37:58] !log cscott@deploy1003 cscott: Continuing with deployment [15:40:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P92096 and previous config saved to /var/cache/conftool/dbconfig/20260430-154027-fceratto.json [15:41:45] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1280491|Enable Parsoid postprocessing cache on enwiki (T424880)]] (duration: 08m 13s) [15:41:53] T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880 [15:41:59] dancy: ok, done. for real this time i hope. [15:44:01] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11876239 (10Ladsgroup) Amazing. Thank you!!! \o/ [15:44:31] cscott: Thanks, and good luck! [15:44:41] !log dancy@deploy1003 Installing scap version "4.256.0" for 2 host(s) [15:45:30] the cache save rate is rising as page views transition, but so far nothing alarming. 🤞 [15:46:32] !log dancy@deploy1003 Installation of scap version "4.256.0" completed for 2 hosts [15:50:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T419961)', diff saved to https://phabricator.wikimedia.org/P92098 and previous config saved to /var/cache/conftool/dbconfig/20260430-155034-fceratto.json [15:50:55] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [15:51:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92099 and previous config saved to /var/cache/conftool/dbconfig/20260430-155102-fceratto.json [16:00:05] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:03:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92100 and previous config saved to /var/cache/conftool/dbconfig/20260430-160307-fceratto.json [16:05:45] (03PS1) 10AKhatun: stream: mw-page-html-feature-counts-change-enrich; increase source parallelism to 6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280508 (https://phabricator.wikimedia.org/T423920) [16:06:24] jouncebot: nowandnext [16:06:24] For the next 0 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1600) [16:06:24] In 0 hour(s) and 53 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1700) [16:06:24] In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1700) [16:09:20] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:11:27] zabe: puppet window isn't in use today, as you could probably tell [16:13:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P92103 and previous config saved to /var/cache/conftool/dbconfig/20260430-161315-fceratto.json [16:13:50] 06SRE, 10observability: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T422816#11876405 (10herron) Today I reimaged kafka-logging2005 with `--move-vlan` and afterwards the node is having trouble rejoining the cluster. I'm seeing errors like... [16:16:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:19] thx [16:18:07] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-worker23[57-74] - https://phabricator.wikimedia.org/T418925#11876423 (10Jhancock.wm) [16:18:17] (03PS1) 10Gkyziridis: ml-services: Roll back to the previous model revertrisk-multilingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280514 [16:20:50] (03CR) 10Gkyziridis: [C:03+2] ml-services: Roll back to the previous model revertrisk-multilingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280514 (owner: 10Gkyziridis) [16:22:28] (03CR) 10JavierMonton: [C:03+1] stream: mw-page-html-feature-counts-change-enrich; increase source parallelism to 6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280508 (https://phabricator.wikimedia.org/T423920) (owner: 10AKhatun) [16:22:53] (03PS1) 10Atsuko: dse-k8s: deploy additional opensearch clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280515 (https://phabricator.wikimedia.org/T424248) [16:22:57] (03Merged) 10jenkins-bot: ml-services: Roll back to the previous model revertrisk-multilingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280514 (owner: 10Gkyziridis) [16:23:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P92104 and previous config saved to /var/cache/conftool/dbconfig/20260430-162323-fceratto.json [16:23:56] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11876468 (10Papaul) Yes I can take care of that. [16:24:20] (03PS2) 10AKhatun: stream: mw-page-html-feature-counts-change-enrich; increase source parallelism to 6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280508 (https://phabricator.wikimedia.org/T423920) [16:25:50] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:26:02] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:26:35] (03CR) 10AKhatun: [C:03+2] stream: mw-page-html-feature-counts-change-enrich; increase source parallelism to 6 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280508 (https://phabricator.wikimedia.org/T423920) (owner: 10AKhatun) [16:28:35] (03Merged) 10jenkins-bot: stream: mw-page-html-feature-counts-change-enrich; increase source parallelism to 6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280508 (https://phabricator.wikimedia.org/T423920) (owner: 10AKhatun) [16:30:01] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:30:04] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:31:07] (03CR) 10CDanis: fundraising_data_import maintenance script wrapper & timer (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:31:57] (03PS1) 10Medelius: Suggestion Mode controlled experiment: limit exposure to newcomers [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280516 (https://phabricator.wikimedia.org/T422736) [16:32:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280516 (https://phabricator.wikimedia.org/T422736) (owner: 10Medelius) [16:33:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T419961)', diff saved to https://phabricator.wikimedia.org/P92105 and previous config saved to /var/cache/conftool/dbconfig/20260430-163332-fceratto.json [16:33:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [16:34:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T419961)', diff saved to https://phabricator.wikimedia.org/P92106 and previous config saved to /var/cache/conftool/dbconfig/20260430-163400-fceratto.json [16:34:20] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:40:32] (03PS2) 10Atsuko: dse-k8s: deploy additional opensearch clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280515 (https://phabricator.wikimedia.org/T424248) [16:42:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T419961)', diff saved to https://phabricator.wikimedia.org/P92107 and previous config saved to /var/cache/conftool/dbconfig/20260430-164211-fceratto.json [16:52:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P92108 and previous config saved to /var/cache/conftool/dbconfig/20260430-165221-fceratto.json [16:52:42] 10ops-eqiad, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1378:9290 - https://phabricator.wikimedia.org/T425015 (10phaultfinder) 03NEW [17:00:05] bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1700) [17:02:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P92109 and previous config saved to /var/cache/conftool/dbconfig/20260430-170229-fceratto.json [17:03:04] (03PS1) 10Jasmine: sophroid: define nodePort to utilze custom load balancers, [0] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280521 (https://phabricator.wikimedia.org/T418748) [17:04:41] !log dancy@deploy1003 Installing scap version "4.257.0" for 2 host(s) [17:04:51] 10ops-eqiad, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1378:9290 - https://phabricator.wikimedia.org/T425015#11876581 (10Jclark-ctr) a:03Jclark-ctr [17:06:32] !log dancy@deploy1003 Installation of scap version "4.257.0" completed for 2 hosts [17:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [17:10:00] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [17:10:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T419961)', diff saved to https://phabricator.wikimedia.org/P92110 and previous config saved to /var/cache/conftool/dbconfig/20260430-171237-fceratto.json [17:12:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [17:13:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T419961)', diff saved to https://phabricator.wikimedia.org/P92111 and previous config saved to /var/cache/conftool/dbconfig/20260430-171306-fceratto.json [17:13:26] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:13:34] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:14:34] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [17:14:38] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [17:15:23] (03PS3) 10Dduvall: zuul: create profile for new zuul-launcher replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [17:15:32] (03PS2) 10Jasmine: sophroid: define nodePort to utilze custom load balancers, [0] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280521 (https://phabricator.wikimedia.org/T418748) [17:15:45] (03PS4) 10Dduvall: zuul: create profile for new zuul-launcher replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [17:17:00] (03CR) 10Dduvall: "Sorry, Daniel. I messed up in the task description. It's actually zuul-launcher, not zuul-builder. I renamed everything. I think this also" [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [17:19:57] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11876609 (10Nux) >>! In T414805#11875703, @daniel wrote: >>>! In T414805#11875160, @A_smart_kitten wrote: >> While it may be true tha... [17:21:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T419961)', diff saved to https://phabricator.wikimedia.org/P92112 and previous config saved to /var/cache/conftool/dbconfig/20260430-172119-fceratto.json [17:22:10] RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:22:21] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:22:39] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:23:32] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:23:50] 06SRE, 06Infrastructure-Foundations, 10GitLab (CI & Job Runners), 13Patch-For-Review, 06Release-Engineering-Team (Priority Backlog 📥): Update default GitLab runner image to a base image without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423971#11876616 (10dancy) 05Open→03Resolved... [17:28:26] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:31:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P92113 and previous config saved to /var/cache/conftool/dbconfig/20260430-173127-fceratto.json [17:34:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [17:41:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P92114 and previous config saved to /var/cache/conftool/dbconfig/20260430-174135-fceratto.json [17:43:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [17:51:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T419961)', diff saved to https://phabricator.wikimedia.org/P92115 and previous config saved to /var/cache/conftool/dbconfig/20260430-175143-fceratto.json [17:52:04] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [17:52:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T419961)', diff saved to https://phabricator.wikimedia.org/P92116 and previous config saved to /var/cache/conftool/dbconfig/20260430-175211-fceratto.json [17:57:10] (03PS5) 10Dduvall: zuul: create profile for new zuul-launcher replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [17:57:38] (03CR) 10Dduvall: "Added kubeconfig for zuul-launcher and a new connection section to zuul.conf." [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [17:57:47] (03CR) 10CDobbins: [C:03+2] wikimedia.org: Add TXT verification for Claude [dns] - 10https://gerrit.wikimedia.org/r/1279402 (https://phabricator.wikimedia.org/T424785) (owner: 10CDobbins) [18:00:05] jeena and dduvall: That opportune time for a MediaWiki train - Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T1800). [18:00:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T419961)', diff saved to https://phabricator.wikimedia.org/P92117 and previous config saved to /var/cache/conftool/dbconfig/20260430-180036-fceratto.json [18:04:14] !log cdobbins@dns1005 START - running authdns-update [18:05:53] !log cdobbins@dns1005 END - running authdns-update [18:07:33] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Anthropic - https://phabricator.wikimedia.org/T424785#11876724 (10CDobbins) 05Open→03In progress p:05Triage→03Medium [18:08:31] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Anthropic - https://phabricator.wikimedia.org/T424785#11876731 (10CDobbins) I just updated our DNS records, @bcampbell. Let me know if there's any unexpected behavior or if I can close the ti... [18:08:35] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280551 (https://phabricator.wikimedia.org/T423877) [18:08:38] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280551 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [18:08:50] (03PS1) 10Medelius: Abandon the editor survey: update edit count restriction [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280552 (https://phabricator.wikimedia.org/T422931) [18:09:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280552 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius) [18:09:33] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280551 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [18:10:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P92118 and previous config saved to /var/cache/conftool/dbconfig/20260430-181044-fceratto.json [18:15:13] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.26 refs T423877 [18:15:17] T423877: 1.46.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T423877 [18:20:25] (03CR) 10CI reject: [V:04-1] Abandon the editor survey: update edit count restriction [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280552 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius) [18:20:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P92119 and previous config saved to /var/cache/conftool/dbconfig/20260430-182052-fceratto.json [18:28:06] (03CR) 10Medelius: "recheck" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280552 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius) [18:28:06] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:28:11] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:29:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:30:13] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.153.221 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:31:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T419961)', diff saved to https://phabricator.wikimedia.org/P92120 and previous config saved to /var/cache/conftool/dbconfig/20260430-183100-fceratto.json [18:31:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [18:31:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T419961)', diff saved to https://phabricator.wikimedia.org/P92121 and previous config saved to /var/cache/conftool/dbconfig/20260430-183130-fceratto.json [18:35:13] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:44:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T419961)', diff saved to https://phabricator.wikimedia.org/P92122 and previous config saved to /var/cache/conftool/dbconfig/20260430-184439-fceratto.json [18:44:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:54:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P92123 and previous config saved to /var/cache/conftool/dbconfig/20260430-185451-fceratto.json [19:04:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P92124 and previous config saved to /var/cache/conftool/dbconfig/20260430-190459-fceratto.json [19:12:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11876957 (10elukey) @Jclark-ctr if you have time could you please check the status of the 1007's BMC? Like if you are able to access the WebUI somehow... [19:15:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T419961)', diff saved to https://phabricator.wikimedia.org/P92125 and previous config saved to /var/cache/conftool/dbconfig/20260430-191507-fceratto.json [19:15:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1251.eqiad.wmnet with reason: Maintenance [19:15:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T419961)', diff saved to https://phabricator.wikimedia.org/P92126 and previous config saved to /var/cache/conftool/dbconfig/20260430-191527-fceratto.json [19:23:13] (03PS3) 10Jasmine: sophroid: define nodePort to utilze custom load balancers [0] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280521 (https://phabricator.wikimedia.org/T418748) [19:24:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T419961)', diff saved to https://phabricator.wikimedia.org/P92127 and previous config saved to /var/cache/conftool/dbconfig/20260430-192407-fceratto.json [19:34:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P92128 and previous config saved to /var/cache/conftool/dbconfig/20260430-193415-fceratto.json [19:34:57] (03CR) 10Scott French: [C:03+1] "Thanks, Jasmine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280521 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [19:41:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P92129 and previous config saved to /var/cache/conftool/dbconfig/20260430-194423-fceratto.json [19:54:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T419961)', diff saved to https://phabricator.wikimedia.org/P92130 and previous config saved to /var/cache/conftool/dbconfig/20260430-195431-fceratto.json [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T2000) [20:00:05] cscott, VadymTS1, and cmede: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] o/ [20:00:24] o/ [20:00:56] I'm going to spiderpig mine straight out of the gate here, since I'd like to be able to watch the cache stats during the duration of the backport window, in case I need to dial things back. [20:01:20] I'm here [20:01:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279454 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [20:02:32] (03Merged) 10jenkins-bot: Increase Parsoid Read Views to 100% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279454 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [20:02:49] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1279454|Increase Parsoid Read Views to 100% of enwiki mobile web traffic (T424880)]] [20:02:54] T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880 [20:04:32] !log cscott@deploy1003 cscott: Backport for [[gerrit:1279454|Increase Parsoid Read Views to 100% of enwiki mobile web traffic (T424880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:06:02] !log cscott@deploy1003 cscott: Continuing with deployment [20:06:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [20:09:20] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:09:50] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279454|Increase Parsoid Read Views to 100% of enwiki mobile web traffic (T424880)]] (duration: 07m 01s) [20:09:55] T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880 [20:10:05] (03CR) 10Jasmine: [C:03+2] sophroid: define nodePort to utilze custom load balancers [0] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280521 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [20:10:13] (03CR) 10RLazarus: [C:03+1] sophroid: define nodePort to utilze custom load balancers [0] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280521 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [20:10:14] ok, i'm done [20:10:14] 1 [20:10:28] hi, I’d like to add something to the window [20:10:28] VadymTS1: over to you [20:10:58] ok [20:11:02] lets start [20:11:11] (03PS1) 10Kosta Harlan: hCaptcha: Label load and execute duration metrics with outcome [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280656 (https://phabricator.wikimedia.org/T421204) [20:11:23] (03PS1) 10Kosta Harlan: hCaptcha: Reduce default MAX_LOAD_ATTEMPTS from 10 to 6 [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280657 (https://phabricator.wikimedia.org/T421204) [20:11:26] kostajh: schedule-deployment on gerrit works up until the window closes, i believe. :) [20:11:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [20:11:55] yep, will add it [20:12:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: who's the deployer for this window? [20:12:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280657 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [20:12:08] (03Merged) 10jenkins-bot: sophroid: define nodePort to utilze custom load balancers [0] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280521 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [20:12:23] I can deploy [20:12:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280656 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [20:12:39] thanks jeena [20:12:41] i've finished my config patch, we're up to VadymTS1 in the corner [20:12:42] my patches can be synced together [20:12:46] *order, not corner :) [20:12:54] and they do not need to be verified either [20:13:16] Thanks jeena. Sorry I'm at the hackathon and heading to bed [20:13:27] You have a deployer now? [20:14:07] TheresNoTime: yeah all god [20:14:10] good* [20:14:22] cool :) [20:15:00] VadymTS1: is it fine to deploy all your changes together? [20:15:33] I'm think yes [20:15:41] I don't see problems [20:17:14] It's not like it's forbidden [20:18:31] yes of course, just making sure! [20:18:37] I will proceed now [20:20:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274928 (https://phabricator.wikimedia.org/T423461) (owner: 10Codename Noreste) [20:20:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279477 (https://phabricator.wikimedia.org/T424898) (owner: 10VadymTS1) [20:20:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247186 (https://phabricator.wikimedia.org/T418815) (owner: 10MGChecker) [20:20:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236361 (https://phabricator.wikimedia.org/T416174) (owner: 10Seawolf35gerrit) [20:20:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280449 (https://phabricator.wikimedia.org/T424983) (owner: 10VadymTS1) [20:21:19] (03Merged) 10jenkins-bot: ukwiki: Remove the patroller user group and adjust various user rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274928 (https://phabricator.wikimedia.org/T423461) (owner: 10Codename Noreste) [20:21:22] (03Merged) 10jenkins-bot: nlwiki: Modify autoconfirmed requirements for nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279477 (https://phabricator.wikimedia.org/T424898) (owner: 10VadymTS1) [20:21:26] (03Merged) 10jenkins-bot: dewiki: Add abusefilter group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247186 (https://phabricator.wikimedia.org/T418815) (owner: 10MGChecker) [20:21:30] (03Merged) 10jenkins-bot: Add map domains for ruwiki to the list of externallinks-excluded domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236361 (https://phabricator.wikimedia.org/T416174) (owner: 10Seawolf35gerrit) [20:21:33] (03Merged) 10jenkins-bot: [eswiktionary] Switch $wgSignatureValidation to 'disallow' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1280449 (https://phabricator.wikimedia.org/T424983) (owner: 10VadymTS1) [20:21:47] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1274928|ukwiki: Remove the patroller user group and adjust various user rights (T423461)]], [[gerrit:1279477|nlwiki: Modify autoconfirmed requirements for nlwiki (T424898)]], [[gerrit:1247186|dewiki: Add abusefilter group (T418815)]], [[gerrit:1236361|Add map domains for ruwiki to the list of externallinks-excluded domains (T416174)]], [[gerrit:128044 [20:21:47] 9|[eswiktionary] Switch $wgSignatureValidation to 'disallow' (T424983)]] [20:21:57] T423461: Turn off patrolling in ukwiki - https://phabricator.wikimedia.org/T423461 [20:21:57] T424898: Modify autoconfirmed requirements for nlwiki - https://phabricator.wikimedia.org/T424898 [20:21:57] T418815: Add abusefilter group to dewiki - https://phabricator.wikimedia.org/T418815 [20:21:58] T416174: Add map domains for ruwiki to the list of externallinks-excluded domains (wgExternalLinksIgnoreDomains) - https://phabricator.wikimedia.org/T416174 [20:21:58] T424983: Set $wgSignatureValidation to 'disallow' on Spanish Wiktionary - https://phabricator.wikimedia.org/T424983 [20:23:30] !log jhuneidi@deploy1003 vadymts1, seawolf35gerrit, jhuneidi, codenamenoreste, mgchecker: Backport for [[gerrit:1274928|ukwiki: Remove the patroller user group and adjust various user rights (T423461)]], [[gerrit:1279477|nlwiki: Modify autoconfirmed requirements for nlwiki (T424898)]], [[gerrit:1247186|dewiki: Add abusefilter group (T418815)]], [[gerrit:1236361|Add map domains for ruwiki to the list of externallinks-exclu [20:23:30] ded domains (T416174)]], [[gerrit:1280449|[eswiktionary] Switch $wgSignatureValidation to 'disallow' (T424983)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:23:40] cheking [20:27:54] wait a minute a have bad internet [20:28:47] no problem [20:29:49] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:30:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11877160 (10Jclark-ctr) hooked crashcart up to 1007 bmc is set to dhcp and is not picking up any address. [20:30:33] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:31:56] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:32:27] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:32:27] jeena All good [20:32:44] Thanks! [20:32:49] !log jhuneidi@deploy1003 vadymts1, seawolf35gerrit, jhuneidi, codenamenoreste, mgchecker: Continuing with deployment [20:35:40] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: apply [20:36:40] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1274928|ukwiki: Remove the patroller user group and adjust various user rights (T423461)]], [[gerrit:1279477|nlwiki: Modify autoconfirmed requirements for nlwiki (T424898)]], [[gerrit:1247186|dewiki: Add abusefilter group (T418815)]], [[gerrit:1236361|Add map domains for ruwiki to the list of externallinks-excluded domains (T416174)]], [[gerrit:12804 [20:36:41] 49|[eswiktionary] Switch $wgSignatureValidation to 'disallow' (T424983)]] (duration: 14m 53s) [20:37:00] T423461: Turn off patrolling in ukwiki - https://phabricator.wikimedia.org/T423461 [20:37:00] T424898: Modify autoconfirmed requirements for nlwiki - https://phabricator.wikimedia.org/T424898 [20:37:00] T418815: Add abusefilter group to dewiki - https://phabricator.wikimedia.org/T418815 [20:37:01] T416174: Add map domains for ruwiki to the list of externallinks-excluded domains (wgExternalLinksIgnoreDomains) - https://phabricator.wikimedia.org/T416174 [20:37:02] T424983: Set $wgSignatureValidation to 'disallow' on Spanish Wiktionary - https://phabricator.wikimedia.org/T424983 [20:37:43] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: apply [20:38:02] cmede: do you need a deployer? [20:38:06] yes please! [20:38:14] 👍 [20:39:55] It's fine to do both your changes in one deploy right? [20:40:01] yep :) [20:40:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:40:26] \i/ [20:40:30] jasmine_: ^ [20:40:31] \i/ [20:40:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280516 (https://phabricator.wikimedia.org/T422736) (owner: 10Medelius) [20:40:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280552 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius) [20:40:50] Chlorinated [20:41:17] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:42:03] (03Merged) 10jenkins-bot: Suggestion Mode controlled experiment: limit exposure to newcomers [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280516 (https://phabricator.wikimedia.org/T422736) (owner: 10Medelius) [20:42:05] dancy: 😆 [20:42:08] (03Merged) 10jenkins-bot: Abandon the editor survey: update edit count restriction [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280552 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius) [20:42:25] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1280516|Suggestion Mode controlled experiment: limit exposure to newcomers (T422736)]], [[gerrit:1280552|Abandon the editor survey: update edit count restriction (T422931)]] [20:42:32] T422736: Define and implement any missing metrics needed for Suggestion Mode controlled experiment - https://phabricator.wikimedia.org/T422736 [20:42:32] T422931: Implement the "Exit the editor" survey - https://phabricator.wikimedia.org/T422931 [20:42:33] nicee, thanks swfrench-wmf! [20:44:05] !log jhuneidi@deploy1003 caro, jhuneidi: Backport for [[gerrit:1280516|Suggestion Mode controlled experiment: limit exposure to newcomers (T422736)]], [[gerrit:1280552|Abandon the editor survey: update edit count restriction (T422931)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:44:14] checking~~ [20:46:12] all good [20:46:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11877235 (10Jclark-ctr) As soon as i started provision script it started to ping I aborted its back to you. [20:46:31] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/sophroid: apply [20:46:55] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/sophroid: apply [20:47:13] !log jhuneidi@deploy1003 caro, jhuneidi: Continuing with deployment [20:47:16] thanks cmede [20:47:47] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1378:9290 - https://phabricator.wikimedia.org/T425015#11877257 (10Jclark-ctr) 05Open→03Resolved [20:49:11] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:49:35] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:49:58] jeene Sorry to bother you again, but it seems the change 1280449 hasn't been applied, can you see her. That's my promise [20:50:57] VadymTS1: it says it was merged, so it should be deployed. Is it not working? [20:50:59] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1280516|Suggestion Mode controlled experiment: limit exposure to newcomers (T422736)]], [[gerrit:1280552|Abandon the editor survey: update edit count restriction (T422931)]] (duration: 08m 33s) [20:51:04] T422736: Define and implement any missing metrics needed for Suggestion Mode controlled experiment - https://phabricator.wikimedia.org/T422736 [20:51:05] T422931: Implement the "Exit the editor" survey - https://phabricator.wikimedia.org/T422931 [20:51:21] thank you jeena! [20:51:30] yw! [20:51:50] No I see another problem the Phabricator don't see the SAL logs [20:52:29] jeena: will you sync the two patches I have up, or would you like for me to do it? [20:52:45] I can do it if you prefer! [20:52:48] VadymTS1: that’s probably just because they were all synced together [20:52:49] (03CR) 10Dzahn: [C:03+2] zuul: remove zuul-nodepool config, user, stop service [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [20:53:02] jeena: happy for you to do it as it’s late here [20:53:09] 👍 [20:53:13] thanks kostajh [20:53:26] Yes I see the cod is working [20:54:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280657 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [20:54:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280656 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [20:57:55] VadymTS1: I think what happened is probably there is a character limit for the SAL log and since we synced so many changes the final one got cut off [20:59:10] yep this is https://phabricator.wikimedia.org/T285709 [20:59:20] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:59:27] it's not SAL per se, it's an IRC message length limit -- the message starting with "!log" gets split in two [20:59:32] oh thanks rzl! [20:59:41] I see [21:00:03] but because we use IRC to carry messages from the deployment server to the SAL, that's the limiting factor in between [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260430T2100) [21:00:13] just noticed the second line after you mentioned that [21:00:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [21:05:49] (03Merged) 10jenkins-bot: hCaptcha: Reduce default MAX_LOAD_ATTEMPTS from 10 to 6 [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280657 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [21:05:51] (03Merged) 10jenkins-bot: hCaptcha: Label load and execute duration metrics with outcome [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280656 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [21:06:08] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1280657|hCaptcha: Reduce default MAX_LOAD_ATTEMPTS from 10 to 6 (T421204)]], [[gerrit:1280656|hCaptcha: Label load and execute duration metrics with outcome (T421204)]] [21:07:09] !log zuul1001/zuul2001 - rmdir /etc/nodepool [21:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:48] !log jhuneidi@deploy1003 kharlan, jhuneidi: Backport for [[gerrit:1280657|hCaptcha: Reduce default MAX_LOAD_ATTEMPTS from 10 to 6 (T421204)]], [[gerrit:1280656|hCaptcha: Label load and execute duration metrics with outcome (T421204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:08:05] !log jhuneidi@deploy1003 kharlan, jhuneidi: Continuing with deployment [21:08:39] rzl: i was considering filing a task about that at some point some time ago, glad to see that there is already one :D [21:10:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:55] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1280657|hCaptcha: Reduce default MAX_LOAD_ATTEMPTS from 10 to 6 (T421204)]], [[gerrit:1280656|hCaptcha: Label load and execute duration metrics with outcome (T421204)]] (duration: 05m 47s) [21:12:06] jeena: thanks! [21:12:16] yw! [21:12:22] A_smart_kitten: yeah! I can't exactly say with a straight face that we're prioritizing it 🙃 but it's known at least [21:12:48] (03PS6) 10Dzahn: zuul: create profile for new zuul-launcher replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) [21:21:54] (03CR) 10Dzahn: [V:04-1] "Function lookup() did not find a value for the name 'profile::zuul::launcher::user_token'" [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [21:23:46] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:28:14] !log cdobbins@cumin2002 conftool action : get/pooled; selector: name=cp4041.ulsfo.wmnet [21:28:26] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:28:42] !log cdobbins@cumin2002 conftool action : get/pooled; selector: name=cp* [21:31:41] !log cdobbins@cumin2002 conftool action : get/pooled; selector: name=cp4044.ulsfo.wmnet [21:31:48] !log cdobbins@cumin2002 conftool action : get/pooled; selector: name=cp4040.ulsfo.wmnet [21:42:29] (03PS1) 10Dzahn: zuul: rename nodepool::user_token to launcher::user_token [labs/private] - 10https://gerrit.wikimedia.org/r/1280729 (https://phabricator.wikimedia.org/T424879) [21:43:03] (03CR) 10Dzahn: [V:03+2 C:03+2] zuul: rename nodepool::user_token to launcher::user_token [labs/private] - 10https://gerrit.wikimedia.org/r/1280729 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [21:45:34] (03PS1) 10Cwhite: update pyyaml in dev [software/ecs] - 10https://gerrit.wikimedia.org/r/1280733 [21:48:48] (03CR) 10Cwhite: [C:04-1] logstash: add thanos-query-frontend filter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [21:48:54] (03CR) 10Bking: [C:03+1] "Feel free to deploy one or two clusters once the DNS piece is ready, no need to deploy every one quite yet." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280515 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [21:50:29] (03CR) 10Dzahn: [V:04-1] "renamed the nodepool::user_token to launcher::user_token in private and fake private" [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [21:52:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [21:52:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:53:29] (03PS1) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) [21:55:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [21:58:10] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-Needs-Improvement: switchdc SAL log entries are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709#11877426 (10A_smart_kitten) Just noting for the task record that this also affects (e.g.) s... [21:58:29] (03CR) 10Cwhite: [C:04-1] logstash: add thanos-query-frontend filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [22:02:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [22:02:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [22:23:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [22:24:26] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Anthropic - https://phabricator.wikimedia.org/T424785#11877498 (10bcampbell) @CDobbins All looks good on the Anthropic end, I'm seeing the domain as verified now. Thanks for your help, feel f... [22:40:41] (03CR) 10Scott French: [C:03+1] "Thanks, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [22:53:20] (03PS1) 10Cwhite: opensearch: move pki::get_cert call into profile module [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) [22:53:56] (03CR) 10CI reject: [V:04-1] opensearch: move pki::get_cert call into profile module [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) (owner: 10Cwhite) [23:03:32] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:08:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [23:28:32] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:30:14] (03CR) 10Dzahn: [C:03+2] zuul: create profile for new zuul-launcher replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [23:40:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1280813 [23:40:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1280813 (owner: 10TrainBranchBot) [23:40:55] (03PS1) 10Dzahn: zuul: remove nodepool profile from zuul::main role [puppet] - 10https://gerrit.wikimedia.org/r/1280815 (https://phabricator.wikimedia.org/T424879) [23:41:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:43:25] (03CR) 10Dzahn: [C:03+2] zuul: remove nodepool profile from zuul::main role [puppet] - 10https://gerrit.wikimedia.org/r/1280815 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [23:50:32] (03PS1) 10Dzahn: zuul: add placeholder template for launcher config [puppet] - 10https://gerrit.wikimedia.org/r/1280820 (https://phabricator.wikimedia.org/T424879) [23:51:26] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul2001.codfw.wmnet with reason: T421398 [23:51:31] T421398: SystemdUnitFailed - zuul-executor - https://phabricator.wikimedia.org/T421398 [23:51:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1280813 (owner: 10TrainBranchBot) [23:51:58] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421398 [23:54:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:55:13] (03CR) 10Dzahn: [C:03+2] zuul: add placeholder template for launcher config [puppet] - 10https://gerrit.wikimedia.org/r/1280820 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [23:56:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate