[00:05:25] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:24:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:51:45] RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:09:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1291031 [01:09:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1291031 (owner: 10TrainBranchBot) [01:21:59] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1291031 (owner: 10TrainBranchBot) [01:24:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:29:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:33:51] FIRING: NetworkDeviceAlarmActive: Alarm active on lsw1-c1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-c1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:47:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:40] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:08:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:11:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:33:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:36:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:41:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:41:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:46:51] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:54:53] (03PS1) 10Marostegui: pc1012: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/1291233 (https://phabricator.wikimedia.org/T426930) [04:56:32] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc1012.eqiad.wmnet [04:56:39] (03CR) 10Marostegui: [C:03+2] pc1012: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/1291233 (https://phabricator.wikimedia.org/T426930) (owner: 10Marostegui) [05:01:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11947213 (10Marostegui) 05Open→03Resolved All good ` root@dbproxy2005:~# cat /proc/mdstat Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb2[2] sda2[0]... [05:03:08] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [05:05:52] (03CR) 10Marostegui: "@fceratto@wikimedia.org can you take a look?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [05:06:27] (03CR) 10Marostegui: sre.mysql.upgrade: support multiinstance hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [05:06:55] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc1012.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:10:00] marostegui@cumin1003 decommission (PID 3133403) is awaiting input [05:11:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc1012.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:11:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:11:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1012.eqiad.wmnet [05:13:22] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1012.eqiad.wmnet - https://phabricator.wikimedia.org/T426930#11947217 (10Marostegui) a:05Marostegui→03None [05:13:32] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1012.eqiad.wmnet - https://phabricator.wikimedia.org/T426930#11947222 (10Marostegui) Ready for DCOps [05:17:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy2005.codfw.wmnet with reason: Reboot [05:19:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1028.eqiad.wmnet with reason: Reboot [05:22:44] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1291552 (https://phabricator.wikimedia.org/T426633) [05:23:12] !log Failover m5-master T426633 [05:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:03] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1291552 (https://phabricator.wikimedia.org/T426633) (owner: 10Marostegui) [05:24:07] !log marostegui@dns1004 START - running authdns-update [05:25:44] !log marostegui@dns1004 END - running authdns-update [05:56:19] (03CR) 10Marostegui: "Ah cool - you can always mark the commit as WIP so people know it is not ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260522T0600) [06:01:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3007.wikimedia.org [06:04:31] (03CR) 10Samwilson: [C:03+1] "Looks correct to me, although I've not tested it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290055 (https://phabricator.wikimedia.org/T426897) (owner: 10Kosta Harlan) [06:07:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3007.wikimedia.org [06:08:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [06:08:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11947290 (10ops-monitoring-bot) Draining ganeti1024.eqiad.wmnet of running VMs [06:11:20] (03PS1) 10Muehlenhoff: Make ganeti1057/1058 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1291630 (https://phabricator.wikimedia.org/T424680) [06:13:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [06:16:00] (03PS1) 10Muehlenhoff: Record LDAP access for chudson [puppet] - 10https://gerrit.wikimedia.org/r/1291638 [06:19:57] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for chudson [puppet] - 10https://gerrit.wikimedia.org/r/1291638 (owner: 10Muehlenhoff) [06:23:39] (03PS1) 10Muehlenhoff: Record LDAP access for zsinger [puppet] - 10https://gerrit.wikimedia.org/r/1291647 [06:28:12] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for zsinger [puppet] - 10https://gerrit.wikimedia.org/r/1291647 (owner: 10Muehlenhoff) [06:29:45] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti1057/1058 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1291630 (https://phabricator.wikimedia.org/T424680) (owner: 10Muehlenhoff) [06:46:32] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3067.esams.wmnet} and A:cp [06:56:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1057.eqiad.wmnet [06:58:24] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3067.esams.wmnet [06:58:24] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp3067.esams.wmnet} and A:cp [06:58:40] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3075.esams.wmnet} and A:cp [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260522T0700) [07:01:02] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1057 [07:02:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1057 [07:04:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1057.eqiad.wmnet [07:06:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1057.eqiad.wmnet to cluster eqiad and group A [07:10:25] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3075.esams.wmnet [07:10:25] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp3075.esams.wmnet} and A:cp [07:11:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1057.eqiad.wmnet to cluster eqiad and group A [07:12:17] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1012.eqiad.wmnet - https://phabricator.wikimedia.org/T426930#11947349 (10VRiley-WMF) a:03VRiley-WMF [07:14:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:14:14] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp306[8-9].esams.wmnet} and A:cp [07:17:37] (03CR) 10Federico Ceratto: sre.mysql.pool: Add support for downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [07:19:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:25:28] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:26:04] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3068.esams.wmnet [07:29:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:30:59] (03PS1) 10Muehlenhoff: Update role contacts in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1291711 [07:31:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [07:32:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11947369 (10ops-monitoring-bot) Draining ganeti1024.eqiad.wmnet of running VMs [07:33:51] FIRING: NetworkDeviceAlarmActive: Alarm active on lsw1-c1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-c1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:34:20] (03CR) 10CWilliams: sre.mysql.pool: Add support for downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [07:38:56] (03CR) 10Federico Ceratto: sre.mysql.upgrade: support multiinstance hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [07:43:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11947396 (10Martyn.ranyard) I can confirm for Annie here that she is actually her, I h... [07:45:45] (03CR) 10Federico Ceratto: sre.mysql.pool: Add support for downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [07:47:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [07:53:41] (03PS1) 10Brouberol: Remove un-needed login/password keys from wikidata_platform_s3_dpe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291830 [07:55:42] (03PS2) 10Brouberol: Remove un-needed login/password keys from wikidata_platform_s3_dpe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291830 [07:58:36] (03CR) 10Brouberol: [C:03+2] Remove un-needed login/password keys from wikidata_platform_s3_dpe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291830 (owner: 10Brouberol) [07:59:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11947403 (10MoritzMuehlenhoff) [08:00:25] (03CR) 10Marostegui: sre.mysql.pool: Add support for downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [08:05:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [08:05:40] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [08:07:48] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3069.esams.wmnet [08:07:48] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp306[8-9].esams.wmnet} and A:cp [08:09:56] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [08:11:34] (03PS1) 10Brouberol: Add a way to verify the SSL CA for the S3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291859 [08:11:38] PROBLEM - Host ganeti1058 is DOWN: PING CRITICAL - Packet loss = 100% [08:12:06] ^ expected [08:15:30] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change records for ganeti1058 - cmooney@cumin1003" [08:15:33] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ganeti1058.eqiad.wmnet on all recursors [08:15:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change records for ganeti1058 - cmooney@cumin1003" [08:15:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:15:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti1058.eqiad.wmnet on all recursors [08:16:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11947420 (10SLyngshede-WMF) @Martyn.ranyard is only for the SSH key :-) We just need... [08:18:28] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp307[6-7].esams.wmnet} and A:cp [08:18:53] (03CR) 10Brouberol: [C:03+2] Add a way to verify the SSL CA for the S3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291859 (owner: 10Brouberol) [08:19:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11947422 (10SLyngshede-WMF) [08:20:56] (03Merged) 10jenkins-bot: Add a way to verify the SSL CA for the S3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291859 (owner: 10Brouberol) [08:21:06] RECOVERY - Host ganeti1058 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [08:21:42] (03PS3) 10Slyngshede: admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) (owner: 10Dzahn) [08:25:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11947423 (10SLyngshede-WMF) Key has been verified. I'll ping you once I have everythin... [08:30:04] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3076.esams.wmnet [08:33:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [08:33:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [08:39:06] (03PS1) 10Elukey: sre.hosts.reimage: test force_http_boot_once override [cookbooks] - 10https://gerrit.wikimedia.org/r/1291876 [08:40:02] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [08:46:39] elukey@cumin1003 reimage (PID 3156845) is awaiting input [08:46:51] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [08:47:17] (03PS2) 10Elukey: sre.hosts.reimage: test force_http_boot_once override [cookbooks] - 10https://gerrit.wikimedia.org/r/1291876 [08:47:37] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [08:50:11] (03PS10) 10Gkyziridis: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) [08:55:06] (03CR) 10Ilias Sarantopoulos: [C:03+1] rest-gateway: Configure qwen3-14b in rest-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [09:03:20] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [09:03:58] (03PS3) 10Elukey: sre.hosts.reimage: test force_http_boot_once override [cookbooks] - 10https://gerrit.wikimedia.org/r/1291876 [09:04:21] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:06:53] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: test force_http_boot_once override [cookbooks] - 10https://gerrit.wikimedia.org/r/1291876 (owner: 10Elukey) [09:10:30] (03PS1) 10Hnowlan: cdn: exempt performance from paging [alerts] - 10https://gerrit.wikimedia.org/r/1291896 (https://phabricator.wikimedia.org/T425299) [09:10:35] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.6 to staging with poller enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291897 [09:11:49] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3077.esams.wmnet [09:11:49] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp307[6-7].esams.wmnet} and A:cp [09:14:26] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp307[0-1].esams.wmnet} and A:cp [09:16:48] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [09:17:16] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v1.3.6 to staging with poller enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291897 (owner: 10Santiago Faci) [09:17:57] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.3.6 to staging with poller enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291897 (owner: 10Santiago Faci) [09:18:46] (03PS4) 10Elukey: sre.hosts.reimage: test force_http_boot_once override [cookbooks] - 10https://gerrit.wikimedia.org/r/1291876 [09:20:15] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.6 to staging with poller enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291897 (owner: 10Santiago Faci) [09:20:50] (03PS5) 10Elukey: sre.hosts.reimage: test force_http_boot_once override [cookbooks] - 10https://gerrit.wikimedia.org/r/1291876 [09:21:03] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:22:57] (03CR) 10JMeybohm: [C:03+1] docker_registry: move the /ml prefix to its new S3 backend [puppet] - 10https://gerrit.wikimedia.org/r/1290808 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [09:24:58] (03CR) 10JMeybohm: [C:03+1] k8s: add wikikube-worker2331 [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) (owner: 10Jasmine) [09:26:04] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3070.esams.wmnet [09:26:12] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [09:26:22] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [09:28:24] (03PS3) 10Btullis: Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) [09:30:35] (03PS1) 10Jcrespo: mediabackup: Update s3cmd client configuration to eqiad/codfw sites [puppet] - 10https://gerrit.wikimedia.org/r/1291904 (https://phabricator.wikimedia.org/T420506) [09:30:46] (03CR) 10Klausman: [C:03+1] docker_registry: move the /ml prefix to its new S3 backend [puppet] - 10https://gerrit.wikimedia.org/r/1290808 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [09:31:19] (03PS2) 10Jcrespo: mediabackup: Update s3cmd client configuration to eqiad/codfw sites [puppet] - 10https://gerrit.wikimedia.org/r/1291904 (https://phabricator.wikimedia.org/T420506) [09:31:22] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1291904 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [09:33:05] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [09:34:06] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2009.codfw.wmnet with OS trixie [09:34:33] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:34:36] (03CR) 10Klausman: [C:03+1] rest-gateway: Configure qwen3-14b in rest-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [09:34:54] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1172: Upgrading db1172.eqiad.wmnet [09:35:43] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1172: Upgrading db1172.eqiad.wmnet [09:37:29] (03PS1) 10Muehlenhoff: Remove ganeti1024 from the eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1291905 (https://phabricator.wikimedia.org/T424680) [09:38:09] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1172.eqiad.wmnet with OS trixie [09:38:21] (03CR) 10Jcrespo: [C:03+2] mediabackup: Update s3cmd client configuration to eqiad/codfw sites [puppet] - 10https://gerrit.wikimedia.org/r/1291904 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [09:38:42] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:39:05] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2154: Upgrading db2154.codfw.wmnet [09:39:35] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2154: Upgrading db2154.codfw.wmnet [09:42:35] cwilliams@cumin1003 major-upgrade (PID 3164266) is awaiting input [09:50:08] (03PS7) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) [09:51:34] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [09:53:33] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage [09:55:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [09:56:21] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2154.codfw.wmnet with OS trixie [09:59:21] (03CR) 10FNegri: sre.mysql.upgrade: support multiinstance hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [09:59:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage [10:00:05] 06SRE, 10Infrastructure Security: Rollout ptrace hardening to roles which allow it - https://phabricator.wikimedia.org/T427039 (10MoritzMuehlenhoff) 03NEW [10:00:55] (03CR) 10Marostegui: sre.mysql.upgrade: support multiinstance hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [10:02:30] (03CR) 10Kosta Harlan: hCaptcha CommonSettings.php: Don't define sitekeys as config vars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 (owner: 10Dreamy Jazz) [10:03:47] (03CR) 10Kosta Harlan: hCaptcha CommonSettings.php: Don't define sitekeys as config vars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 (owner: 10Dreamy Jazz) [10:06:07] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3071.esams.wmnet [10:06:08] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp307[0-1].esams.wmnet} and A:cp [10:07:52] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp307[8-9].esams.wmnet} and A:cp [10:08:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:11:37] (03CR) 10Muehlenhoff: [C:03+1] "Thanks" [alerts] - 10https://gerrit.wikimedia.org/r/1291896 (https://phabricator.wikimedia.org/T425299) (owner: 10Hnowlan) [10:13:09] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2154.codfw.wmnet with reason: host reimage [10:13:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:18] (03PS1) 10Btullis: Add the ability to create secrets containing S3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291930 (https://phabricator.wikimedia.org/T426764) [10:15:08] (03PS11) 10Gkyziridis: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) [10:15:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) (owner: 10Dzahn) [10:15:37] !log fnegri@cumin1003 START - Cookbook sre.mysql.upgrade for clouddb1017.eqiad.wmnet [10:16:02] (03PS12) 10Clément Goubert: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:16:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1172.eqiad.wmnet with OS trixie [10:17:52] (03CR) 10Slyngshede: [C:03+2] admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) (owner: 10Dzahn) [10:18:23] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: host reimage [10:19:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:37] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3078.esams.wmnet [10:21:59] (03PS2) 10Effie Mouzeli: site.pp: retire mc1037-mc1054 [puppet] - 10https://gerrit.wikimedia.org/r/1289287 (https://phabricator.wikimedia.org/T426303) [10:22:49] (03CR) 10JMeybohm: [C:03+1] site.pp: retire mc1037-mc1054 [puppet] - 10https://gerrit.wikimedia.org/r/1289287 (https://phabricator.wikimedia.org/T426303) (owner: 10Effie Mouzeli) [10:24:12] (03CR) 10Btullis: Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [10:24:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:17] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1172: Migration of db1172.eqiad.wmnet completed [10:24:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11947757 (10SLyngshede-WMF) @AnnieKim_WMDE you should be getting an email with a t... [10:24:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11947759 (10SLyngshede-WMF) 05Stalled→03Resolved [10:26:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [10:29:20] (03PS4) 10Btullis: [airflow-sre] Add a new cephfs PVC for data transfer purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288881 (https://phabricator.wikimedia.org/T380626) [10:31:12] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2010.codfw.wmnet with OS trixie [10:31:22] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploying v1.3.6 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291943 [10:34:19] (03PS4) 10Dreamy Jazz: hCaptcha CommonSettings.php: Don't define sitekeys as config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 [10:34:42] 06SRE, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11947795 (10MLechvien-WMF) 05Stalled→03Open This is unstalled now and ready to be picked up [10:35:00] (03PS5) 10Dreamy Jazz: hCaptcha CommonSettings.php: Don't define sitekeys as config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 [10:35:10] (03PS1) 10Clément Goubert: rest-gateway: Fix cache-control header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291944 (https://phabricator.wikimedia.org/T426323) [10:35:19] (03CR) 10Dreamy Jazz: hCaptcha CommonSettings.php: Don't define sitekeys as config vars (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 (owner: 10Dreamy Jazz) [10:35:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2154.codfw.wmnet with OS trixie [10:35:46] (03PS4) 10Btullis: Create a new role for the dse-k8s nodes tha are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) [10:37:31] !log remove ganeti1024 foom eqiad Ganeti cluster T424680 [10:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:35] T424680: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680 [10:37:51] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Fix cache-control header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291944 (https://phabricator.wikimedia.org/T426323) (owner: 10Clément Goubert) [10:38:33] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploying v1.3.6 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291943 (owner: 10Santiago Faci) [10:38:33] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti1024 from the eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/1291905 (https://phabricator.wikimedia.org/T424680) (owner: 10Muehlenhoff) [10:38:51] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploying v1.3.6 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291943 (owner: 10Santiago Faci) [10:38:53] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix cache-control header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291944 (https://phabricator.wikimedia.org/T426323) (owner: 10Clément Goubert) [10:40:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1012.eqiad.wmnet - https://phabricator.wikimedia.org/T426930#11947804 (10VRiley-WMF) 05Open→03Resolved [10:40:14] (03PS5) 10Btullis: Create a new role for the dse-k8s nodes tha are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) [10:40:40] PROBLEM - ganeti-noded running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:40:40] PROBLEM - ganeti-confd running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:40:50] FIRING: ProbeDown: Service ganeti1024:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:54] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.3.6 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291943 (owner: 10Santiago Faci) [10:41:01] (03PS1) 10Filippo Giunchedi: prometheus: align bastion_hosts puppet type [puppet] - 10https://gerrit.wikimedia.org/r/1291946 (https://phabricator.wikimedia.org/T424814) [10:41:03] (03PS1) 10Filippo Giunchedi: alerts: add transformations option [puppet] - 10https://gerrit.wikimedia.org/r/1291947 (https://phabricator.wikimedia.org/T424814) [10:41:04] (03Merged) 10jenkins-bot: rest-gateway: Fix cache-control header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291944 (https://phabricator.wikimedia.org/T426323) (owner: 10Clément Goubert) [10:41:05] (03PS1) 10Filippo Giunchedi: toolforge: use alerts::deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1291948 (https://phabricator.wikimedia.org/T424814) [10:41:35] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8574/co" [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [10:41:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [10:42:10] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:42:45] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2154: Migration of db2154.codfw.wmnet completed [10:42:49] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:42:58] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:43:14] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:43:18] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:43:35] (03PS2) 10Kosta Harlan: hCaptcha: Exempt CommunityRequests pages from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290055 (https://phabricator.wikimedia.org/T426897) [10:43:36] (03PS6) 10Btullis: Create a new role for the dse-k8s nodes that are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) [10:43:36] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:43:52] (03CR) 10Kosta Harlan: hCaptcha: Exempt CommunityRequests pages from edit/create triggers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290055 (https://phabricator.wikimedia.org/T426897) (owner: 10Kosta Harlan) [10:44:08] (03CR) 10CI reject: [V:04-1] alerts: add transformations option [puppet] - 10https://gerrit.wikimedia.org/r/1291947 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [10:44:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11947823 (10MoritzMuehlenhoff) [10:44:54] (03CR) 10Tiziano Fogli: [C:03+1] cdn: exempt performance from paging [alerts] - 10https://gerrit.wikimedia.org/r/1291896 (https://phabricator.wikimedia.org/T425299) (owner: 10Hnowlan) [10:45:44] (03PS2) 10Filippo Giunchedi: alerts: add transformations option [puppet] - 10https://gerrit.wikimedia.org/r/1291947 (https://phabricator.wikimedia.org/T424814) [10:45:44] (03PS2) 10Filippo Giunchedi: toolforge: use alerts::deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1291948 (https://phabricator.wikimedia.org/T424814) [10:47:33] (03CR) 10Majavah: "does this work? I'd expect the sudo `--preserve-env` flag to have to come before the `cookbook` command is given to it" [puppet] - 10https://gerrit.wikimedia.org/r/1290858 (owner: 10Andrew Bogott) [10:47:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [10:47:50] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [10:48:12] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [10:49:29] (03PS7) 10Btullis: Create a new role for the dse-k8s nodes that are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) [10:50:50] RESOLVED: ProbeDown: Service ganeti1024:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:50:56] (03PS24) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [10:51:41] (03CR) 10Btullis: [C:03+2] [airflow-sre] Add a new cephfs PVC for data transfer purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288881 (https://phabricator.wikimedia.org/T380626) (owner: 10Btullis) [10:53:12] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [10:53:42] (03Merged) 10jenkins-bot: [airflow-sre] Add a new cephfs PVC for data transfer purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288881 (https://phabricator.wikimedia.org/T380626) (owner: 10Btullis) [10:55:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1058.eqiad.wmnet to cluster eqiad and group C [10:55:19] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1058.eqiad.wmnet to cluster eqiad and group C [10:56:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1058.eqiad.wmnet [10:57:11] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom eqord POP - https://phabricator.wikimedia.org/T427050 (10cmooney) 03NEW p:05Triage→03Medium [10:57:38] (03PS25) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [10:57:40] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom eqord POP - https://phabricator.wikimedia.org/T427050#11947853 (10cmooney) [10:58:37] (03PS1) 10Arthur taylor: Enable and configure WikiProjects prototype on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291951 (https://phabricator.wikimedia.org/T424329) [10:58:40] (03PS2) 10Hnowlan: cdn: exempt performance from paging [alerts] - 10https://gerrit.wikimedia.org/r/1291896 (https://phabricator.wikimedia.org/T425299) [10:58:57] (03Abandoned) 10Clément Goubert: rest-gateway: Let recommendation-api-ng set CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290865 (https://phabricator.wikimedia.org/T426323) (owner: 10Clément Goubert) [10:58:59] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom eqord POP - https://phabricator.wikimedia.org/T427050#11947860 (10cmooney) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260522T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: I, the Bot under the Fountain, call upon thee, The Deployer, to do GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260522T1100). [11:00:18] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:01:07] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3079.esams.wmnet [11:01:07] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp307[8-9].esams.wmnet} and A:cp [11:03:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1058.eqiad.wmnet [11:06:09] (03CR) 10Audrey Penven: [C:03+1] Enable and configure WikiProjects prototype on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291951 (https://phabricator.wikimedia.org/T424329) (owner: 10Arthur taylor) [11:06:59] (03PS1) 10Tiziano Fogli: performance.w.o: add http blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1291950 (https://phabricator.wikimedia.org/T425299) [11:07:11] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp307[2-3].esams.wmnet} and A:cp [11:07:47] 06SRE, 06Traffic: wiki.openstreetmap.org Commons thumbs rate limit allowance - https://phabricator.wikimedia.org/T423570#11947887 (10jcrespo) 05Open→03Resolved I am not seeing any 429 from this source in the last 15 days, so tentatively resolving. Please reopen if you disagree. [11:08:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291951 (https://phabricator.wikimedia.org/T424329) (owner: 10Arthur taylor) [11:09:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1172: Migration of db1172.eqiad.wmnet completed [11:09:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [11:11:08] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1017.eqiad.wmnet with reason: Rebooting clouddb1017 [11:11:29] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:59] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:15:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1058.eqiad.wmnet to cluster eqiad and group C [11:19:15] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3072.esams.wmnet [11:19:35] jmm@cumin2002 addnode (PID 1303958) is awaiting input [11:28:15] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2154: Migration of db2154.codfw.wmnet completed [11:28:16] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [11:28:18] 10ops-ulsfo, 06SRE, 06DC-Ops: magru: decom fibre link from cr3-ulsfo to cr4-ulsfo - https://phabricator.wikimedia.org/T427054 (10cmooney) 03NEW p:05Triage→03Medium [11:30:18] (03PS26) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [11:30:38] (03CR) 10CI reject: [V:04-1] P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:30:43] 10ops-ulsfo, 06SRE, 06DC-Ops: ulsfo: decom fibre link from cr3-ulsfo to cr4-ulsfo - https://phabricator.wikimedia.org/T427054#11947955 (10cmooney) [11:33:07] 10ops-ulsfo, 06SRE, 06DC-Ops: ulsfo: decom fibre links from cr3-ulsfo to cr4-ulsfo - https://phabricator.wikimedia.org/T427054#11947961 (10cmooney) [11:33:52] FIRING: NetworkDeviceAlarmActive: Alarm active on lsw1-c1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-c1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:42:31] (03CR) 10Brouberol: Add the ability to create secrets containing S3 tokens (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291930 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [11:43:42] (03CR) 10Brouberol: [C:03+1] Create a new role for the dse-k8s nodes that are dedicated to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [11:45:09] (03PS27) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [11:45:57] 10ops-codfw, 06SRE, 06DC-Ops: lsw1-c1-codfw: PEM 0 loss of power - https://phabricator.wikimedia.org/T427057 (10cmooney) 03NEW p:05Triage→03High [11:46:03] (03CR) 10Gehel: [C:03+1] "LGTM. Inline note about ducplication, but it should not be a blocker." [puppet] - 10https://gerrit.wikimedia.org/r/1290827 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [11:47:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:19] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:51:02] (03PS1) 10VadymTS1: Modify various configurations for English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) [12:01:16] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3073.esams.wmnet [12:01:16] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp307[2-3].esams.wmnet} and A:cp [12:03:45] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp308[0-1].esams.wmnet} and A:cp [12:05:40] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1058.eqiad.wmnet to cluster eqiad and group C [12:11:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:11:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11948064 (10MoritzMuehlenhoff) [12:11:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:15:32] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3080.esams.wmnet [12:30:45] (03PS1) 10Brouberol: idp/idp_test: temporarily rollback growthbook(-next) access to nda/wmf [puppet] - 10https://gerrit.wikimedia.org/r/1291975 (https://phabricator.wikimedia.org/T420691) [12:31:14] (03CR) 10Bearloga: [C:03+1] idp/idp_test: temporarily rollback growthbook(-next) access to nda/wmf [puppet] - 10https://gerrit.wikimedia.org/r/1291975 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [12:31:27] (03CR) 10Brouberol: [C:03+2] idp/idp_test: temporarily rollback growthbook(-next) access to nda/wmf [puppet] - 10https://gerrit.wikimedia.org/r/1291975 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [12:37:02] (03PS3) 10CWilliams: sre.mysql.pool: Add support for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) [12:41:12] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:47:56] FIRING: ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:00] (03PS28) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [12:52:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:22] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:56:35] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1291981 [12:57:43] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp3081.esams.wmnet [12:57:43] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp308[0-1].esams.wmnet} and A:cp [12:59:06] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp5017.eqsin.wmnet} and A:cp [13:01:50] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1291981 (owner: 10Elukey) [13:02:23] (03CR) 10Muehlenhoff: [C:03+2] Update WDQS Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1282974 (owner: 10Muehlenhoff) [13:03:04] (03PS1) 10Elukey: Upstream release v12.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1291982 [13:03:21] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1291982 (owner: 10Elukey) [13:08:55] !log fnegri@cumin1003 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for clouddb1017.eqiad.wmnet [13:09:07] !log uploaded spicerack_12.6.0 to apt.wikimedia.org bookworm-wikimedia [13:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:14] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:10:44] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet [13:11:34] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5017.eqsin.wmnet [13:11:34] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp5017.eqsin.wmnet} and A:cp [13:11:56] (03PS2) 10Btullis: Add the ability to create secrets containing S3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291930 (https://phabricator.wikimedia.org/T426764) [13:12:40] (03CR) 10Btullis: Add the ability to create secrets containing S3 tokens (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291930 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [13:14:28] (03PS29) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [13:15:14] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:15:17] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [13:15:42] !log fnegri@cumin1003 START - Cookbook sre.mysql.upgrade for 6 hosts [13:15:59] !log bking@deploy1002 set search_codfw cluster recovery settings from 4 to 7 T426560 [13:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:36] (03CR) 10Hnowlan: [C:03+2] cdn: exempt performance from paging [alerts] - 10https://gerrit.wikimedia.org/r/1291896 (https://phabricator.wikimedia.org/T425299) (owner: 10Hnowlan) [13:18:33] (03PS30) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [13:18:41] FIRING: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [13:18:41] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:18:46] (03Merged) 10jenkins-bot: cdn: exempt performance from paging [alerts] - 10https://gerrit.wikimedia.org/r/1291896 (https://phabricator.wikimedia.org/T425299) (owner: 10Hnowlan) [13:20:01] !ack [13:20:02] All incidents are already acked. [13:20:49] fnegri@cumin1003 upgrade (PID 3284834) is awaiting input [13:20:49] ^^ is this expected? [13:21:01] fabfur: I don't think so [13:21:04] nothing in SAL [13:21:14] but, topranks was talking about some issues in codfw [13:21:22] topranks: are aware of any network issues in codfw? [13:21:23] it says Raft internal error in the logs [13:21:35] etcdserver: Request timed out [13:21:38] 07:46:32 < topranks> sry not now, yesterday afternoon looks like we lost power on one side to lsw1-c1-codfw [13:21:49] I am wondering if this is related in any way but not sure [13:21:51] sukhe: what are you doing here on a Friday? [13:21:58] topranks: :D [13:22:05] nah that's no issue really, switch lost one power supply but it's working fine on the other one [13:22:13] ok thanks and sorry for the noise [13:22:16] T427057 [13:22:16] T427057: lsw1-c1-codfw: PEM 0 loss of power - https://phabricator.wikimedia.org/T427057 [13:22:17] ok then, we need to look into it [13:23:03] https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication says [13:23:05] "In the event that etcdmirror fails (indicated by the EtcdReplicationDown alert), it should be safe to try restarting the systemd unit if logs suggest a transient issue - e.g., connectivity to the source cluster. " [13:23:21] I can try it [13:23:27] I'm on 2005 [13:23:31] ping me if it does seem network related [13:23:34] (conf2005) [13:23:39] yeah, let's give it a shot [13:23:49] {{ done}} [13:23:59] looks like it's replicating, from journal [13:24:14] all conf* nodes are up, so it appears to have been more of a connectivy issue? [13:24:51] seems to replicate fine again for no [13:24:53] seems to replicate fine again for now [13:24:54] I keep an eye on the logs [13:25:02] !log fnegri@cumin1003 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for 6 hosts [13:25:18] !log fnegri@cumin1003 START - Cookbook sre.mysql.upgrade for clouddb1018.eqiad.wmnet [13:25:29] !log fnegri@cumin1003 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for clouddb1018.eqiad.wmnet [13:25:36] (03Abandoned) 10Lerickson: Remove airflow-wikidata S3 credentials in "connections" and "extra_secrets". [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290915 (https://phabricator.wikimedia.org/T426764) (owner: 10Lerickson) [13:28:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:28:41] RESOLVED: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [13:28:48] nice [13:31:06] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2100 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:32:25] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:16] (03PS1) 10Federico Ceratto: sre.mysql.upgrade: add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) [13:36:06] (03CR) 10Mpostoronca: [C:03+1] hCaptcha CommonSettings.php: Don't define sitekeys as config vars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 (owner: 10Dreamy Jazz) [13:37:25] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:06] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2100 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:42:25] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:52] (03PS1) 10Elukey: WIP: sre.hosts.provision: introduce the wmfroot user [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) [13:46:15] !log fnegri@cumin1003 START - Cookbook sre.mysql.upgrade for clouddb1018.eqiad.wmnet [13:49:05] (03PS2) 10Elukey: WIP: sre.hosts.provision: introduce the wmfroot user [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) [13:49:39] (03PS1) 10Dreamy Jazz: Replace deprecated Hooks::getInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291996 (https://phabricator.wikimedia.org/T426981) [13:50:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [13:50:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [13:52:19] !log fnegri@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for clouddb1018.eqiad.wmnet [13:52:26] (03PS1) 10Andrew Bogott: Remove refs to cloudnet200[78]-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1291998 (https://phabricator.wikimedia.org/T427071) [13:53:26] !log andrew@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudnet2007-dev.codfw.wmnet [13:54:35] (03CR) 10CI reject: [V:04-1] Remove refs to cloudnet200[78]-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1291998 (https://phabricator.wikimedia.org/T427071) (owner: 10Andrew Bogott) [13:55:50] (03PS2) 10Andrew Bogott: Remove refs to cloudnet200[78]-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1291998 (https://phabricator.wikimedia.org/T427071) [13:56:55] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [13:57:12] (03PS4) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [13:57:13] (03PS1) 10FNegri: sre.mysql.upgrade: fix looping logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 [13:58:35] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [13:59:13] !log fnegri@cumin1003 START - Cookbook sre.mysql.upgrade for clouddb[1020,1022-1025].eqiad.wmnet [14:00:13] (03CR) 10Andrew Bogott: [C:03+2] Remove refs to cloudnet200[78]-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1291998 (https://phabricator.wikimedia.org/T427071) (owner: 10Andrew Bogott) [14:03:56] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet2007-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin2002" [14:04:21] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [14:04:40] (03PS1) 10Fabfur: hiera: disable cidergrinder (as emergency measure) [puppet] - 10https://gerrit.wikimedia.org/r/1292002 [14:05:10] (03CR) 10Fabfur: [C:04-1] "Gave -1 just to be sure no-one enables it if not strictly needed" [puppet] - 10https://gerrit.wikimedia.org/r/1292002 (owner: 10Fabfur) [14:06:25] 10ops-codfw, 06SRE, 06DC-Ops: lsw1-c1-codfw: PEM 0 loss of power - https://phabricator.wikimedia.org/T427057#11948444 (10RobH) a:03Jhancock.wm According to the calendar Papaul is out on vacation today, and loss of power redundancy is a pretty high priority so pinged Jenn in irc and assigning to her here. [14:07:01] andrew@cumin2002 decommission (PID 1402154) is awaiting input [14:07:33] (03CR) 10Brouberol: [C:03+1] Add the ability to create secrets containing S3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291930 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [14:08:02] (03CR) 10Btullis: [C:03+2] Add the ability to create secrets containing S3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291930 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [14:09:35] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [14:10:08] (03Merged) 10jenkins-bot: Add the ability to create secrets containing S3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1291930 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [14:10:27] (03CR) 10CI reject: [V:04-1] Modify various configurations for English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [14:14:25] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [14:20:59] (03PS2) 10VadymTS1: Modify various configurations for English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) [14:22:16] (03CR) 10VadymTS1: "Codenamenireste please start the recheck. I fix the code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [14:23:04] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet2007-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin2002" [14:23:05] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:23:06] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudnet2007-dev.codfw.wmnet [14:23:24] !log andrew@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudnet2008-dev.codfw.wmnet [14:23:35] (03CR) 10Codename Noreste: Modify various configurations for English Wikibooks (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [14:26:12] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:26:18] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:27:21] (03PS3) 10VadymTS1: Modify various configurations for English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) [14:27:27] (03CR) 10VadymTS1: Modify various configurations for English Wikibooks (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [14:28:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:29:44] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [14:30:29] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [14:32:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [14:33:40] !log fnegri@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for clouddb[1020,1022-1025].eqiad.wmnet [14:33:50] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet2008-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin2002" [14:34:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet2008-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin2002" [14:34:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:34:29] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudnet2008-dev.codfw.wmnet [14:34:49] (03PS4) 10Btullis: Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) [14:35:42] 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudnet200[78]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T427071#11948531 (10Andrew) a:05Andrew→03None [14:35:46] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [14:39:23] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072 (10MoritzMuehlenhoff) 03NEW [14:40:04] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2076 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:40:57] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: fasw2-c8a-codfw:xe-0/0/47 low RX power - https://phabricator.wikimedia.org/T426824#11948571 (10Jhancock.wm) 05Open→03Resolved [14:40:59] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11948572 (10MoritzMuehlenhoff) [14:41:29] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11948574 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:42:25] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:04] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2076 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:52:25] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:38] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:54:09] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [14:54:34] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:55:36] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:55:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:57:48] (03PS5) 10Btullis: Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) [14:59:34] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:59:38] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:00:54] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [15:01:34] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:01:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:02:10] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:02:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:11:21] (03PS2) 10FNegri: sre.mysql.upgrade: fix looping logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 [15:11:21] (03PS5) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [15:12:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:54] (03CR) 10FNegri: sre.mysql.upgrade: support multiinstance hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [15:14:26] (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: fix looping logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (owner: 10FNegri) [15:14:39] (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [15:14:58] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:15:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:16:05] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11948684 (10MoritzMuehlenhoff) [15:18:00] (03PS1) 10Andrew Bogott: trove: install cumin key in new DB instances [puppet] - 10https://gerrit.wikimedia.org/r/1292028 (https://phabricator.wikimedia.org/T422801) [15:18:22] 10ops-codfw, 06SRE, 06DC-Ops: lsw1-c1-codfw: PEM 0 loss of power - https://phabricator.wikimedia.org/T427057#11948691 (10Jhancock.wm) 05Open→03Resolved reseated power cables. physical alert has cleared. [15:21:12] (03PS1) 10Btullis: [mediawiki-dumps-legacy] Replace hyphens with underscores in s3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292031 (https://phabricator.wikimedia.org/T426764) [15:21:22] (03CR) 10CI reject: [V:04-1] [mediawiki-dumps-legacy] Replace hyphens with underscores in s3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292031 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [15:21:23] (03PS2) 10Btullis: [mediawiki-dumps-legacy] Replace hyphens with underscores in s3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292031 (https://phabricator.wikimedia.org/T426764) [15:21:37] (03PS1) 10Dreamy Jazz: Grant globalblock-local-status to groups with globalblock-whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292032 (https://phabricator.wikimedia.org/T277942) [15:23:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292032 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [15:23:51] RESOLVED: NetworkDeviceAlarmActive: Alarm active on lsw1-c1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-c1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:24:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 (owner: 10Dreamy Jazz) [15:26:50] (03PS10) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [15:29:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:30:42] 06SRE, 06Traffic: Startup failure for Bird on new durum hosts - https://phabricator.wikimedia.org/T419868#11948763 (10ssingh) 05Open→03Resolved a:03ssingh We have done quite a few reimages of durum since then (and reboots) and this issue was not observed. I am taking the liberty to close this as part... [15:31:31] (03PS6) 10Btullis: Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) [15:32:25] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2111:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:46] (03CR) 10Btullis: [C:03+2] [mediawiki-dumps-legacy] Replace hyphens with underscores in s3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292031 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [15:35:50] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [15:36:12] 06SRE, 06Traffic: PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - https://phabricator.wikimedia.org/T356951#11948788 (10ssingh) 05Open→03Resolved a:03ssingh There has been no follow-up to this in a while (and this is on k8s anyway now?) and this task has been open since 2024. I a... [15:36:14] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2111 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:37:04] (03Merged) 10jenkins-bot: [mediawiki-dumps-legacy] Replace hyphens with underscores in s3 tokens [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292031 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [15:37:25] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2111:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:53] (03PS1) 10JHathaway: Replace role::mariadb::ferm with profile::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) [15:39:15] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166#11948797 (10ssingh) 05Open→03Resolved a:03ssingh LVS in core sites will be superseded by Liberica so we are unlikely to spend any time on this. I am taking... [15:39:24] (03CR) 10CI reject: [V:04-1] Replace role::mariadb::ferm with profile::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:39:28] (03CR) 10Andrew Bogott: [C:03+2] trove: install cumin key in new DB instances [puppet] - 10https://gerrit.wikimedia.org/r/1292028 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [15:42:25] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2111:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:23] (03PS2) 10JHathaway: Replace role::mariadb::ferm with profile::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) [15:44:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291996 (https://phabricator.wikimedia.org/T426981) (owner: 10Dreamy Jazz) [15:44:58] (03PS1) 10Btullis: [mediawiki-dumps-legacy] Fix the issue with the s3 token name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292034 (https://phabricator.wikimedia.org/T426764) [15:45:08] (03PS2) 10Btullis: [mediawiki-dumps-legacy] Fix the issue with the s3 token name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292034 (https://phabricator.wikimedia.org/T426764) [15:46:14] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2111 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:46:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:48:19] 06SRE, 06Commons, 06Traffic: Backend fetch failed - https://phabricator.wikimedia.org/T383013#11948842 (10ssingh) 05Open→03Resolved a:03ssingh It seems like the issue was transient and therefore I am taking the liberty to close this as part of regular task cleanup. Please re-open if desired. [15:50:02] 06SRE, 10conftool, 06Traffic: confd causes soft lockup when you are tailing a file with -F and the state is updated - https://phabricator.wikimedia.org/T372646#11948872 (10ssingh) 05Open→03Resolved a:03ssingh No one else has observed this issue and it has been almost two years since this was report... [15:50:44] (03PS3) 10FNegri: sre.mysql.upgrade: fix looping logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) [15:50:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2152.codfw.wmnet - https://phabricator.wikimedia.org/T424344#11948899 (10Jhancock.wm) a:03Jhancock.wm [15:53:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2014.codfw.wmnet - https://phabricator.wikimedia.org/T426595#11948923 (10Jhancock.wm) a:03Jhancock.wm [15:54:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11948925 (10BTullis) Hello. Apologies for the trouble, but please could we rename these hosts before proceeding? * `wdqs2... [15:55:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11948933 (10BTullis) Hello. Apologies for the trouble, but please could we rename these hosts before proceeding? * `wdqs1036` -> `dse-k8s-wdqs1001` * `wdqs10... [15:57:26] 06SRE, 06Traffic: Investigate port 80 page in text@esams for Ipv6 - https://phabricator.wikimedia.org/T423667#11948936 (10ssingh) 05Open→03Declined This hasn't happened again and it's hard investigating now what caused these two blips. Boldly resolving for this as part of regular task cleanup. If it ha... [15:58:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11948939 (10BTullis) [15:58:40] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11948942 (10ssingh) @cmooney: We plan to move to Liberica in Q1 or Q2 of APP2026. Do you think we should still consider w... [15:58:50] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:00:51] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2151.codfw.wmnet - https://phabricator.wikimedia.org/T424343#11948947 (10Jhancock.wm) a:03Jhancock.wm [16:00:53] 06SRE, 06Traffic: ATS automatically restarted due to receiving SIGUSR2 on cp5024 - https://phabricator.wikimedia.org/T344674#11948949 (10ssingh) 05Open→03Resolved a:03ssingh This hasn't happened in a while (last incident was 2023) and we have run `sre.cdn.roll-reboot` many times since then, so boldly... [16:02:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#11948956 (10BTullis) [16:02:25] (03PS3) 10JHathaway: Replace role::mariadb::ferm with profile::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) [16:03:12] (03CR) 10Btullis: [C:03+2] [mediawiki-dumps-legacy] Fix the issue with the s3 token name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292034 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [16:04:37] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11948966 (10cmooney) >>! In T405630#11948942, @ssingh wrote: > @cmooney: We plan to move to Liberica in Q1 or Q2 of FY202... [16:05:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:05:24] (03Merged) 10jenkins-bot: [mediawiki-dumps-legacy] Fix the issue with the s3 token name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292034 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [16:05:40] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11948969 (10ssingh) Thanks for the update and the explanation, @cmooney! [16:06:29] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2149.codfw.wmnet - https://phabricator.wikimedia.org/T424341#11948984 (10Jhancock.wm) a:03Jhancock.wm [16:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:23] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2143.codfw.wmnet - https://phabricator.wikimedia.org/T424171#11948990 (10Jhancock.wm) a:03Jhancock.wm [16:12:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [16:12:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [16:14:03] (03CR) 10JHathaway: "Remove, rather than rename, role::mariadb::ferm" [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:14:42] (03CR) 10JHathaway: "Remove version, 1292033" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:15:32] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudnet200[78]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T427071#11948995 (10Jhancock.wm) a:03Jhancock.wm [16:16:16] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:17:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:03] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2013.codfw.wmnet - https://phabricator.wikimedia.org/T426555#11949003 (10Jhancock.wm) 05Open→03Resolved [16:20:25] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2152.codfw.wmnet - https://phabricator.wikimedia.org/T424344#11949006 (10Jhancock.wm) 05Open→03Resolved [16:20:51] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2014.codfw.wmnet - https://phabricator.wikimedia.org/T426595#11949020 (10Jhancock.wm) 05Open→03Resolved [16:21:12] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2150.codfw.wmnet - https://phabricator.wikimedia.org/T424342#11949027 (10Jhancock.wm) 05Open→03Resolved [16:21:34] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2151.codfw.wmnet - https://phabricator.wikimedia.org/T424343#11949030 (10Jhancock.wm) 05Open→03Resolved [16:21:57] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2149.codfw.wmnet - https://phabricator.wikimedia.org/T424341#11949038 (10Jhancock.wm) 05Open→03Resolved [16:22:22] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2143.codfw.wmnet - https://phabricator.wikimedia.org/T424171#11949043 (10Jhancock.wm) 05Open→03Resolved [16:22:54] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudnet200[78]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T427071#11949047 (10Jhancock.wm) 05Open→03Resolved [16:23:22] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828#11949051 (10Jhancock.wm) 05Open→03Resolved [16:24:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:25:26] (03PS6) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [16:31:15] (03PS1) 10Btullis: [mediawiki-dumps-legacy] Fix the hyphens and underscores [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292050 (https://phabricator.wikimedia.org/T426764) [16:31:24] (03PS2) 10Btullis: [mediawiki-dumps-legacy] Fix the hyphens and underscores [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292050 (https://phabricator.wikimedia.org/T426764) [16:34:13] RESOLVED: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:30] (03PS3) 10Btullis: [mediawiki-dumps-legacy] Fix the hyphens and underscores [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292050 (https://phabricator.wikimedia.org/T426764) [16:34:41] (03PS4) 10Btullis: [mediawiki-dumps-legacy] Fix the hyphens and underscores [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292050 (https://phabricator.wikimedia.org/T426764) [16:34:46] (03CR) 10CI reject: [V:04-1] [mediawiki-dumps-legacy] Fix the hyphens and underscores [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292050 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [16:36:37] (03CR) 10Btullis: [C:03+2] Allow incoming traffic to port 7001 and 9999 for wdqs::alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1290771 (https://phabricator.wikimedia.org/T424865) (owner: 10Btullis) [16:40:29] (03CR) 10Btullis: [C:03+2] [mediawiki-dumps-legacy] Fix the hyphens and underscores [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292050 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [16:41:35] (03PS1) 10Andrew Bogott: Revert "trove: install cumin key in new DB instances" [puppet] - 10https://gerrit.wikimedia.org/r/1292058 [16:41:52] (03PS7) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [16:41:56] FIRING: ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:36] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:40] (03Merged) 10jenkins-bot: [mediawiki-dumps-legacy] Fix the hyphens and underscores [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292050 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [16:43:30] (03CR) 10Andrew Bogott: [C:03+2] Revert "trove: install cumin key in new DB instances" [puppet] - 10https://gerrit.wikimedia.org/r/1292058 (owner: 10Andrew Bogott) [16:46:12] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2115 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:46:56] RESOLVED: ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:34] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [16:49:27] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8575/co" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [16:50:18] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [16:52:25] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:52:33] (03CR) 10Btullis: "This looks good. I wonder if we need to set some values for the presto cluster in test as well." [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [16:56:12] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2115 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:07:30] (03PS8) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [17:12:48] (03PS6) 10Dreamy Jazz: hCaptcha CommonSettings.php: Don't define sitekeys as config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 [17:16:57] !enable ttl protection on drmrs ibgp link between CRs [17:18:44] (03CR) 10Dreamy Jazz: hCaptcha CommonSettings.php: Don't define sitekeys as config vars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290964 (owner: 10Dreamy Jazz) [17:20:52] (03PS9) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [17:21:10] topranks: missing !log [17:22:00] taavi: ah yes, it's ok it was either gonna break everything right away or work, and thankfully worked (I had labbed it up but you know) [17:22:05] well spotted thanks <3 [17:26:28] (03PS2) 10Tiziano Fogli: performance.w.o: add http blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1291950 (https://phabricator.wikimedia.org/T425299) [17:27:36] (03PS10) 10FNegri: sre.mysql.upgrade: support multiinstance hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [17:28:53] !log enable ttl protection on ulsfo CRs IBGP session [17:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:23] (03CR) 10FNegri: "Added test coverage, the difference in the output looks reasonable, it's mostly adding new step() log messages." [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [17:34:55] !log enable ttl protection on esams CRs IBGP session [17:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:30] (03CR) 10FNegri: "Well, not really "added", for now I just adapted the existing tests that are covering the single-instance case." [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [17:56:12] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2084 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:57:25] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:01:06] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2108 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:02:25] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:44] (03PS2) 10Jasmine: kafka-main2006: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288917 (https://phabricator.wikimedia.org/T427088) [18:06:12] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2084 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:07:01] (03PS2) 10Jasmine: kafka-main2007: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288918 (https://phabricator.wikimedia.org/T427088) [18:07:25] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:02] (03PS2) 10Jasmine: kafka-main2008: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288919 (https://phabricator.wikimedia.org/T427088) [18:08:53] (03PS2) 10Jasmine: kafka-main2009: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288920 (https://phabricator.wikimedia.org/T427088) [18:09:46] (03PS2) 10Jasmine: kafka-main2010: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288921 (https://phabricator.wikimedia.org/T427088) [18:10:46] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW - https://phabricator.wikimedia.org/T420706#11949332 (10cmooney) Nokia have told us they are going to fix this and the patch is scheduled for releast 26.7.1 which should be out late July/August. [18:11:06] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2108 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:17:07] (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1292091 [18:24:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:29:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:39:14] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [18:52:21] (03PS1) 10Dzahn: contint: disable jenkins on legacy CI hosts [puppet] - 10https://gerrit.wikimedia.org/r/1273919 (https://phabricator.wikimedia.org/T418109) [18:52:32] (03PS2) 10Dzahn: contint: disable jenkins on legacy CI hosts [puppet] - 10https://gerrit.wikimedia.org/r/1273919 (https://phabricator.wikimedia.org/T418109) [18:52:53] (03CR) 10Dzahn: contint: disable jenkins on legacy CI hosts [puppet] - 10https://gerrit.wikimedia.org/r/1273919 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [18:53:24] (03PS2) 10Dzahn: jenkins: add firewall rule for new jenkins to gearman on legacy host [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) [18:54:47] (03PS2) 10Dzahn: ci: switch jenkins proxy target to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/1254308 (https://phabricator.wikimedia.org/T418521) [18:54:54] (03PS3) 10Dzahn: ci: switch jenkins proxy target to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/1254308 (https://phabricator.wikimedia.org/T418521) [19:10:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:15:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:25:09] (03CR) 10DLynch: "This caused T427066." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [19:30:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:42:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed