[00:03:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:24] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:32] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1149515 [00:08:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1149515 (owner: 10TrainBranchBot) [00:09:24] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:26:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:28:20] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:29:44] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1149515 (owner: 10TrainBranchBot) [00:31:16] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53942 bytes in 5.022 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:31:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:44:14] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/01c301b6a6ce2fdf1479f0bbef44ebac3b7144826f5cc0f0e1e0faf3c58880ef/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:46:16] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [01:04:14] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:48:37] FIRING: [3x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:56:54] (03PS1) 10Raymond Ndibe: toolforge:prometheus:: add components-api scrape endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) [01:59:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:41:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [03:57:36] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1090 to cirrussearch1090 [03:57:45] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1089.eqiad.wmnet with OS bullseye [03:57:48] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [03:57:49] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1089 [03:57:49] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1089 [04:03:24] ryankemper@cumin2002 rename (PID 271136) is awaiting input [04:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:04] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1090 to cirrussearch1090 - ryankemper@cumin2002" [04:07:09] ryankemper@cumin2002 rename (PID 271136) is awaiting input [04:07:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1090 to cirrussearch1090 - ryankemper@cumin2002" [04:07:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:07:23] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1090 on all recursors [04:07:26] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1090 on all recursors [04:07:27] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1090 [04:08:32] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:31] ryankemper@cumin2002 rename (PID 271136) is awaiting input [04:10:40] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1090 [04:11:20] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1090 to cirrussearch1090 [04:16:20] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1089.eqiad.wmnet with reason: host reimage [04:20:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1089.eqiad.wmnet with reason: host reimage [04:21:13] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1090.eqiad.wmnet on all recursors [04:21:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1090.eqiad.wmnet on all recursors [04:21:45] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1090.eqiad.wmnet with OS bullseye [04:21:49] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1090 [04:21:50] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1090 [04:35:45] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1090.eqiad.wmnet with reason: host reimage [04:39:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1090.eqiad.wmnet with reason: host reimage [04:46:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1089.eqiad.wmnet with OS bullseye [05:06:32] (03CR) 10Marostegui: "Please test again to see if it is fixed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:26] (03PS1) 10Marostegui: instances.yaml: Remove db1183 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1149536 (https://phabricator.wikimedia.org/T394507) [05:12:07] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db1183 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1149536 (https://phabricator.wikimedia.org/T394507) (owner: 10Marostegui) [05:13:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1183 from dbctl T394507', diff saved to https://phabricator.wikimedia.org/P76410 and previous config saved to /var/cache/conftool/dbconfig/20250523-051339-marostegui.json [05:13:43] T394507: decommission db1183 - https://phabricator.wikimedia.org/T394507 [05:15:30] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1090.eqiad.wmnet with OS bullseye [05:16:45] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1091 to cirrussearch1091 [05:16:57] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [05:22:34] ryankemper@cumin2002 rename (PID 308701) is awaiting input [05:23:16] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1091 to cirrussearch1091 - ryankemper@cumin2002" [05:24:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1091 to cirrussearch1091 - ryankemper@cumin2002" [05:24:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:24:55] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1091 on all recursors [05:24:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1091 on all recursors [05:25:00] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1091 [05:25:00] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1092 to cirrussearch1092 [05:25:13] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [05:28:03] ryankemper@cumin2002 rename (PID 308701) is awaiting input [05:28:15] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1091 [05:28:41] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1092 to cirrussearch1092 - ryankemper@cumin2002" [05:28:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1091 to cirrussearch1091 [05:31:47] ryankemper@cumin2002 rename (PID 311165) is awaiting input [05:32:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1092 to cirrussearch1092 - ryankemper@cumin2002" [05:32:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:32:02] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1092 on all recursors [05:32:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1092 on all recursors [05:32:06] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1092 [05:32:42] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1092 [05:33:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1092 to cirrussearch1092 [05:38:18] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1092.eqiad.wmnet on all recursors [05:38:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1092.eqiad.wmnet on all recursors [05:38:24] (03PS1) 10Marostegui: mariadb: Decommission db1183 [puppet] - 10https://gerrit.wikimedia.org/r/1149538 (https://phabricator.wikimedia.org/T394507) [05:38:26] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1091.eqiad.wmnet on all recursors [05:38:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1091.eqiad.wmnet on all recursors [05:38:55] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1092.eqiad.wmnet with OS bullseye [05:39:00] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1092 [05:39:00] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1092 [05:39:01] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1091.eqiad.wmnet with OS bullseye [05:39:05] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1091 [05:39:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1091 [05:39:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1183.eqiad.wmnet [05:39:46] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 2 others: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10851232 (10ABran-WMF) [05:39:58] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db1183 [puppet] - 10https://gerrit.wikimedia.org/r/1149538 (https://phabricator.wikimedia.org/T394507) (owner: 10Marostegui) [05:46:14] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [05:48:37] FIRING: [3x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:49:25] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1183.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [05:49:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1183.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [05:49:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:49:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1183.eqiad.wmnet [05:50:15] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1183 - https://phabricator.wikimedia.org/T394507#10851251 (10Marostegui) a:05Marostegui→03None [05:50:23] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1183 - https://phabricator.wikimedia.org/T394507#10851256 (10Marostegui) Ready for DC-Ops [05:52:56] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1091.eqiad.wmnet with reason: host reimage [05:55:57] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1091.eqiad.wmnet with reason: host reimage [05:56:05] (03CR) 10Arnaudb: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1149488 (https://phabricator.wikimedia.org/T393723) (owner: 10Dzahn) [05:56:24] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1092.eqiad.wmnet with reason: host reimage [05:59:31] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1092.eqiad.wmnet with reason: host reimage [05:59:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250523T0600) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:34] (03PS1) 10Marostegui: wmnet: Add pc8-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1149539 (https://phabricator.wikimedia.org/T394260) [06:17:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1092.eqiad.wmnet with OS bullseye [06:17:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [06:18:38] (03CR) 10Marostegui: [C:03+2] wmnet: Add pc8-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1149539 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [06:18:41] !log marostegui@dns1006 START - running authdns-update [06:19:25] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1094 to cirrussearch1094 [06:19:29] !log marostegui@dns1006 END - running authdns-update [06:19:39] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [06:23:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1091.eqiad.wmnet with OS bullseye [06:24:01] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1094 to cirrussearch1094 - ryankemper@cumin2002" [06:24:21] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1093 to cirrussearch1093 [06:24:26] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1094 to cirrussearch1094 - ryankemper@cumin2002" [06:24:26] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:24:27] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1094 on all recursors [06:24:30] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1094 on all recursors [06:24:31] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1094 [06:24:34] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [06:26:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [06:26:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1094 [06:27:35] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1094 to cirrussearch1094 [06:28:16] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1093 to cirrussearch1093 - ryankemper@cumin2002" [06:29:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [06:30:49] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1093 to cirrussearch1093 - ryankemper@cumin2002" [06:30:50] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:30:50] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1093 on all recursors [06:30:53] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1093 on all recursors [06:30:54] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1093 [06:32:07] PROBLEM - OpenSearch unassigned shard check - 9200 on relforge1010 is CRITICAL: CRITICAL - .kibana_1[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:50.630Z), frwiki_content[0](2025-05-19T22:19:50.631Z), frwiki_content[0](2025-05-19T22:19:28.041Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:32:07] PROBLEM - OpenSearch unassigned shard check - 9200 on relforge1009 is CRITICAL: CRITICAL - frwiki_content[0](2025-05-19T22:19:50.631Z), frwiki_content[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:50.630Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:32:07] PROBLEM - OpenSearch unassigned shard check - 9200 on relforge1008 is CRITICAL: CRITICAL - frwiki_content[0](2025-05-19T22:19:50.631Z), frwiki_content[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:50.630Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:33:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [06:33:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [06:33:58] ryankemper@cumin2002 rename (PID 342766) is awaiting input [06:34:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1093 [06:34:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [06:34:49] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1093 to cirrussearch1093 [06:34:59] ACKNOWLEDGEMENT - OpenSearch unassigned shard check - 9200 on relforge1008 is CRITICAL: CRITICAL - frwiki_content[0](2025-05-19T22:19:50.631Z), frwiki_content[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:50.630Z) Ryan Kemper this can wait till tomorrow https://wikitech.wikimedia.org/wiki/Search%23Administration [06:34:59] ACKNOWLEDGEMENT - OpenSearch unassigned shard check - 9200 on relforge1009 is CRITICAL: CRITICAL - frwiki_content[0](2025-05-19T22:19:50.631Z), frwiki_content[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:50.630Z) Ryan Kemper this can wait till tomorrow https://wikitech.wikimedia.org/wiki/Search%23Administration [06:34:59] ACKNOWLEDGEMENT - OpenSearch unassigned shard check - 9200 on relforge1010 is CRITICAL: CRITICAL - .kibana_1[0](2025-05-19T22:19:28.041Z), .kibana_1[0](2025-05-19T22:19:50.630Z), frwiki_content[0](2025-05-19T22:19:50.631Z), frwiki_content[0](2025-05-19T22:19:28.041Z) Ryan Kemper this can wait till tomorrow https://wikitech.wikimedia.org/wiki/Search%23Administration [06:35:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [06:35:57] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1093.eqiad.wmnet with OS bullseye [06:36:01] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1093 [06:36:02] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1093 [06:36:06] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1094.eqiad.wmnet with OS bullseye [06:36:10] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1094 [06:36:10] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1094 [06:36:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [06:38:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [06:47:33] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [06:48:03] (03PS1) 10Muehlenhoff: Switch krb1001 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1149540 (https://phabricator.wikimedia.org/T390863) [06:48:32] FIRING: [4x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:48:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1149488 (https://phabricator.wikimedia.org/T393723) (owner: 10Dzahn) [06:50:46] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1093.eqiad.wmnet with reason: host reimage [06:51:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [06:53:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [06:54:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1093.eqiad.wmnet with reason: host reimage [06:55:30] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1094.eqiad.wmnet with reason: host reimage [06:56:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:00:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250523T0700) [07:00:19] (03CR) 10Muehlenhoff: [C:03+2] Switch krb1001 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1149540 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [07:01:08] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:02:54] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1094.eqiad.wmnet with reason: host reimage [07:03:32] FIRING: [4x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:07:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:10:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:11:07] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:11:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:12:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:14:24] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:14:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:15:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:16:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [07:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:21:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [07:21:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:22:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:23:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1093.eqiad.wmnet with OS bullseye [07:23:30] (03CR) 10Federico Ceratto: "Started another test run - the logging into Phabricator is showing the number: https://phabricator.wikimedia.org/T363665#10851328" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [07:25:44] (03CR) 10Marostegui: "I think it is more useful if it shows the hostname and the instance restarted. Is that doable?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [07:25:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:26:33] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [07:26:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:26:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:26:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:28:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1094.eqiad.wmnet with OS bullseye [07:30:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:32:01] (03PS1) 10Muehlenhoff: Default the Kerberos role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1149542 (https://phabricator.wikimedia.org/T390863) [07:32:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:33:32] (03CR) 10Federico Ceratto: "Yes - do we want a Phab update for each instance restart or one for each host?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [07:34:14] (03CR) 10Marostegui: "This shouldn't be a common operation, so I think each host and instance is good for now. We can be verbose about this." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [07:35:27] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sde) failed in moss-be1002 - https://phabricator.wikimedia.org/T395103 (10MatthewVernon) 03NEW [07:35:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sde) failed in moss-be1002 - https://phabricator.wikimedia.org/T395103#10851369 (10MatthewVernon) p:05Triage→03High [07:36:55] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1108 to cirrussearch1108 [07:37:07] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [07:38:50] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1095 to cirrussearch1095 [07:41:31] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1108 to cirrussearch1108 - ryankemper@cumin2002" [07:41:49] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1108 to cirrussearch1108 - ryankemper@cumin2002" [07:41:49] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:41:49] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1108 on all recursors [07:41:52] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1108 on all recursors [07:41:53] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1108 [07:42:05] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [07:42:06] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1108 [07:42:46] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1108 to cirrussearch1108 [07:42:47] (03PS1) 10Brouberol: datahub: make nocode migration job resources configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149543 [07:44:26] (03CR) 10Joal: [C:03+1] datahub: make nocode migration job resources configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149543 (owner: 10Brouberol) [07:45:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:45:27] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1095 to cirrussearch1095 - ryankemper@cumin2002" [07:45:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1095 to cirrussearch1095 - ryankemper@cumin2002" [07:45:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:45:33] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1095 on all recursors [07:45:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1095 on all recursors [07:45:37] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1095 [07:45:52] (03PS2) 10Brouberol: datahub: make nocode migration job resources configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149543 [07:46:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:48:29] (03CR) 10Brouberol: [C:03+2] datahub: make nocode migration job resources configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149543 (owner: 10Brouberol) [07:48:32] FIRING: [4x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:48:40] ryankemper@cumin2002 rename (PID 380448) is awaiting input [07:49:31] (03PS1) 10Muehlenhoff: Fix auto restart for alertmanager-irc-relay [puppet] - 10https://gerrit.wikimedia.org/r/1149544 [07:49:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149544 (owner: 10Muehlenhoff) [07:49:53] (03CR) 10Btullis: [C:03+1] datahub: make nocode migration job resources configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149543 (owner: 10Brouberol) [07:50:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:50:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:50:48] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1095 [07:51:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1095 to cirrussearch1095 [07:51:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:53:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:55:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:56:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [07:57:49] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1108.eqiad.wmnet with OS bullseye [07:57:53] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1095.eqiad.wmnet with OS bullseye [07:57:54] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1108 [07:57:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1108 [07:57:57] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1095 [07:57:57] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1095 [07:58:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [07:59:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [08:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [08:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:47] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1095.eqiad.wmnet with reason: host reimage [08:12:57] (03PS1) 10Brouberol: datahub: increase resources accross the board [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149602 [08:14:47] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1108.eqiad.wmnet with reason: host reimage [08:14:51] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10851408 (10MatthewVernon) @Jclark-ctr the problem (at least on thanos-be1006 where I started) is that the disks aren't visible to the operating sys... [08:15:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1095.eqiad.wmnet with reason: host reimage [08:15:41] (03CR) 10Btullis: [C:03+1] datahub: increase resources accross the board [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149602 (owner: 10Brouberol) [08:15:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [08:17:35] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1109 to cirrussearch1109 [08:17:47] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [08:18:32] FIRING: [3x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:19:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1108.eqiad.wmnet with reason: host reimage [08:19:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [08:20:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [08:21:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [08:21:21] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1109 to cirrussearch1109 - ryankemper@cumin2002" [08:21:31] (03CR) 10Brouberol: [C:03+2] datahub: increase resources accross the board [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149602 (owner: 10Brouberol) [08:24:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1109 to cirrussearch1109 - ryankemper@cumin2002" [08:24:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:24:05] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1109 on all recursors [08:24:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1109 on all recursors [08:24:09] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1109 [08:24:20] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1109 [08:24:24] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [08:24:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1109 to cirrussearch1109 [08:25:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [08:27:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [08:27:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [08:29:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [08:29:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [08:31:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [08:32:56] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1109.eqiad.wmnet with OS bullseye [08:33:00] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1109 [08:33:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1109 [08:33:03] (03PS1) 10Majavah: conftool-data: Add x3 wiki replica backend services [puppet] - 10https://gerrit.wikimedia.org/r/1149603 (https://phabricator.wikimedia.org/T390954) [08:33:04] (03PS13) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [08:33:04] (03PS1) 10Majavah: P:wmcs::cloudlb: Add x3 wiki replica backend service [puppet] - 10https://gerrit.wikimedia.org/r/1149604 (https://phabricator.wikimedia.org/T390954) [08:33:06] (03PS1) 10Majavah: hieradata: cloudlb: Move x3 VIP to new x3 backend [puppet] - 10https://gerrit.wikimedia.org/r/1149605 (https://phabricator.wikimedia.org/T390954) [08:35:28] (03PS1) 10Majavah: definitions: Add port for x3 wiki replica backend [homer/public] - 10https://gerrit.wikimedia.org/r/1149606 (https://phabricator.wikimedia.org/T390954) [08:39:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1095.eqiad.wmnet with OS bullseye [08:40:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [08:42:12] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1006.eqiad.wmnet with OS bullseye [08:42:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10851473 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host thanos-be1006.eqiad.wmnet with OS bul... [08:42:37] (03PS1) 10Brouberol: datahub-next: increase resources accross the board [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149608 (https://phabricator.wikimedia.org/T395057) [08:43:21] (03CR) 10Btullis: [C:03+1] datahub-next: increase resources accross the board [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149608 (https://phabricator.wikimedia.org/T395057) (owner: 10Brouberol) [08:44:07] (03CR) 10Brouberol: [C:03+2] datahub-next: increase resources accross the board [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149608 (https://phabricator.wikimedia.org/T395057) (owner: 10Brouberol) [08:44:56] (03PS1) 10JMeybohm: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149609 (https://phabricator.wikimedia.org/T378429) [08:49:07] (03CR) 10Federico Ceratto: "Updated and did another run." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [08:49:16] (03PS1) 10FNegri: Revert "Failover all dumps traffic to clouddumps1002" [puppet] - 10https://gerrit.wikimedia.org/r/1149610 [08:50:11] (03CR) 10Marostegui: "The last !log now shows: https://phabricator.wikimedia.org/T363665#10851443" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [08:51:43] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1108.eqiad.wmnet with OS bullseye [08:52:41] (03CR) 10Clément Goubert: [C:03+1] deployment_server: Call into the mwscript helper from mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148490 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [08:53:33] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1109.eqiad.wmnet with reason: host reimage [08:56:03] (03CR) 10Majavah: [C:03+1] Revert "Failover all dumps traffic to clouddumps1002" [puppet] - 10https://gerrit.wikimedia.org/r/1149610 (owner: 10FNegri) [08:56:17] (03CR) 10FNegri: [C:03+2] Revert "Failover all dumps traffic to clouddumps1002" [puppet] - 10https://gerrit.wikimedia.org/r/1149610 (owner: 10FNegri) [08:57:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1109.eqiad.wmnet with reason: host reimage [09:01:51] (03CR) 10JMeybohm: [C:03+2] Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149609 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [09:02:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10851565 (10fnegri) I reverted my [change from last month](https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131051) and moved bac... [09:05:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [09:08:22] (03Merged) 10jenkins-bot: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149609 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [09:10:12] (03CR) 10Tiziano Fogli: [C:03+1] Fix auto restart for alertmanager-irc-relay [puppet] - 10https://gerrit.wikimedia.org/r/1149544 (owner: 10Muehlenhoff) [09:10:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [09:10:33] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1006.eqiad.wmnet with reason: host reimage [09:14:20] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1006.eqiad.wmnet with reason: host reimage [09:18:50] (03PS1) 10Marostegui: es2035: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1149612 (https://phabricator.wikimedia.org/T394469) [09:18:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2035 T394469', diff saved to https://phabricator.wikimedia.org/P76411 and previous config saved to /var/cache/conftool/dbconfig/20250523-091853-marostegui.json [09:18:58] T394469: Migrate es6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T394469 [09:19:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [09:19:23] (03PS1) 10Clément Goubert: mediawiki: Add netpol for prometheus HTTP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149613 (https://phabricator.wikimedia.org/T388538) [09:19:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2035.codfw.wmnet with reason: Maintenance [09:20:37] (03CR) 10CI reject: [V:04-1] mediawiki: Add netpol for prometheus HTTP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149613 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [09:21:10] (03CR) 10Marostegui: [C:03+2] es2035: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1149612 (https://phabricator.wikimedia.org/T394469) (owner: 10Marostegui) [09:23:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1109.eqiad.wmnet with OS bullseye [09:25:17] (03PS2) 10Clément Goubert: mediawiki: Add netpol for prometheus HTTP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149613 (https://phabricator.wikimedia.org/T388538) [09:27:14] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10851628 (10tappof) Sure @RobH, nothing will catch fire because of this patch (or maybe it will, since we’re talking about elec... [09:27:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:27:44] (03PS1) 10Clément Goubert: mw::maintenance::cirrussearch: Skip s8 [puppet] - 10https://gerrit.wikimedia.org/r/1149614 (https://phabricator.wikimedia.org/T388538) [09:27:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:28:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [09:28:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:28:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:28:51] (03CR) 10CI reject: [V:04-1] mw::maintenance::cirrussearch: Skip s8 [puppet] - 10https://gerrit.wikimedia.org/r/1149614 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [09:30:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76412 and previous config saved to /var/cache/conftool/dbconfig/20250523-093015-root.json [09:31:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [09:31:42] (03CR) 10Hnowlan: mediawiki: Add netpol for prometheus HTTP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149613 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [09:32:02] (03PS1) 10Brouberol: airflow: grant analytics airflow instances access to schema.deployment.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149617 (https://phabricator.wikimedia.org/T392668) [09:32:16] (03CR) 10Clément Goubert: mediawiki: Add netpol for prometheus HTTP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149613 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [09:34:21] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:34:40] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:34:58] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for lsobanski - https://phabricator.wikimedia.org/T395110 (10LSobanski) 03NEW [09:36:14] (03PS2) 10Brouberol: airflow: grant airflow instances access to schema.deployment.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149617 (https://phabricator.wikimedia.org/T392668) [09:36:28] (03CR) 10Joal: [C:03+1] "Thank you :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149617 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [09:37:16] (03PS3) 10Clément Goubert: mediawiki: Add netpol for prometheus HTTP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149613 (https://phabricator.wikimedia.org/T388538) [09:40:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [09:40:52] (03PS1) 10MVernon: Set new-style storage for new thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/1149619 (https://phabricator.wikimedia.org/T392908) [09:41:35] (03CR) 10Hnowlan: [C:03+1] mediawiki: Add netpol for prometheus HTTP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149613 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [09:42:48] (03CR) 10Brouberol: [C:03+2] airflow: grant airflow instances access to schema.deployment.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149617 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [09:44:57] (03Abandoned) 10Aqu: airflow-analytics-test: Temporarily Disable DataHub plugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149456 (owner: 10Aqu) [09:45:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76413 and previous config saved to /var/cache/conftool/dbconfig/20250523-094520-root.json [09:47:58] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Add netpol for prometheus HTTP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149613 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [09:50:12] !log cgoubert@deploy1003 Started scap sync-world: 1149613: mediawiki: Add netpol for prometheus HTTP - T388538 [09:50:16] T388538: Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538 [09:52:16] !log cgoubert@deploy1003 Finished scap sync-world: 1149613: mediawiki: Add netpol for prometheus HTTP - T388538 (duration: 03m 11s) [09:52:25] !log isaranto@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: sync [09:52:38] !log isaranto@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [09:55:57] (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1149614 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [09:57:20] (03PS1) 10Clément Goubert: Revert^2 "mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1149623 (https://phabricator.wikimedia.org/T388538) [09:58:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:58:32] (03CR) 10Hnowlan: [C:03+1] Revert^2 "mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1149623 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [09:58:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:58:37] (03PS2) 10Clément Goubert: mw::maintenance::cirrussearch: Skip s8 [puppet] - 10https://gerrit.wikimedia.org/r/1149614 (https://phabricator.wikimedia.org/T388538) [09:58:49] (03PS1) 10Hnowlan: trafficserver: restbaseless reading lists API for ~group0 [puppet] - 10https://gerrit.wikimedia.org/r/1149624 (https://phabricator.wikimedia.org/T384891) [09:58:52] (03PS1) 10Hnowlan: trafficserver: restbaseless reading lists API for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1149625 (https://phabricator.wikimedia.org/T384891) [09:59:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:00:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76414 and previous config saved to /var/cache/conftool/dbconfig/20250523-100026-root.json [10:00:40] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1149623 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [10:00:51] (03CR) 10Federico Ceratto: [C:03+1] "I've done a quick review basic syntax checking." [puppet] - 10https://gerrit.wikimedia.org/r/1149619 (https://phabricator.wikimedia.org/T392908) (owner: 10MVernon) [10:01:34] !log isaranto@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:01:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:01:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:02:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:02:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:02:57] !log isaranto@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:03:13] (03CR) 10MVernon: [C:03+2] Set new-style storage for new thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/1149619 (https://phabricator.wikimedia.org/T392908) (owner: 10MVernon) [10:03:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:04:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:04:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:05:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:05:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:06:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:07:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:07:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:07:05] (03CR) 10DCausse: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1149614 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [10:07:51] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:08:30] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:43] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [10:08:51] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [10:09:24] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:10:24] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:11:04] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::cirrussearch: Skip s8 [puppet] - 10https://gerrit.wikimedia.org/r/1149614 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [10:12:03] (03PS2) 10Hnowlan: trafficserver: restbaseless reading lists API for ~group1 [puppet] - 10https://gerrit.wikimedia.org/r/1149624 (https://phabricator.wikimedia.org/T384891) [10:12:31] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate remaining translation notifications jobs [puppet] - 10https://gerrit.wikimedia.org/r/1149426 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [10:14:44] !log isaranto@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:15:06] !log isaranto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:15:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76415 and previous config saved to /var/cache/conftool/dbconfig/20250523-101532-root.json [10:15:58] (03PS1) 10Brouberol: airflow: deploy a tiny toolbox allowing users to test task networking [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149628 (https://phabricator.wikimedia.org/T392668) [10:16:00] hnowlan: my puppet run on deploy pulled your lpl cronjob changes [10:16:09] so i'll deploy them with the cirrus one [10:16:20] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:16:57] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:18:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [10:18:19] claime: thanks [10:18:51] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [10:18:56] hnowlan: deployed [10:19:23] (03PS2) 10Brouberol: airflow: deploy a tiny toolbox allowing users to test task networking [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149628 (https://phabricator.wikimedia.org/T392668) [10:21:06] (03CR) 10Arnaudb: [C:03+2] admin: add jdlrobson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1149488 (https://phabricator.wikimedia.org/T393723) (owner: 10Dzahn) [10:21:27] cool [10:21:56] mvernon@cumin1002 reimage (PID 1162774) is awaiting input [10:24:22] (03CR) 10Btullis: [C:03+1] airflow: deploy a tiny toolbox allowing users to test task networking [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149628 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [10:24:44] (03PS1) 10Clément Goubert: mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist [puppet] - 10https://gerrit.wikimedia.org/r/1149629 (https://phabricator.wikimedia.org/T388542) [10:25:23] (03CR) 10Brouberol: [C:03+2] airflow: deploy a tiny toolbox allowing users to test task networking [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149628 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [10:27:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:27:55] (03PS1) 10Majavah: P:openstack: Move away from manual @resolve syntax [puppet] - 10https://gerrit.wikimedia.org/r/1149630 [10:27:55] (03PS1) 10Majavah: P:openstack: Pass ports as numbers to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1149631 [10:28:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:29:46] (03PS1) 10Gkyziridis: ml-services: edit-check latest image deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149632 (https://phabricator.wikimedia.org/T394779) [10:29:56] (03PS1) 10Effie Mouzeli: WIP: adding mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 [10:30:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76417 and previous config saved to /var/cache/conftool/dbconfig/20250523-103038-root.json [10:31:02] !log importing ferm 2.5.1-4+wmf13u1 T391083 [10:31:03] (03PS1) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) [10:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:07] (03CR) 10CI reject: [V:04-1] WIP: adding mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (owner: 10Effie Mouzeli) [10:31:08] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [10:31:42] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5670/co" [puppet] - 10https://gerrit.wikimedia.org/r/1149631 (owner: 10Majavah) [10:32:22] (03PS2) 10Effie Mouzeli: WIP: adding mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 [10:33:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1149631 (owner: 10Majavah) [10:33:33] (03CR) 10CI reject: [V:04-1] WIP: adding mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (owner: 10Effie Mouzeli) [10:33:56] (03CR) 10JMeybohm: "Nice one!" [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [10:35:04] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [10:35:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5671/co" [puppet] - 10https://gerrit.wikimedia.org/r/1149630 (owner: 10Majavah) [10:35:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [10:35:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1006.eqiad.wmnet with OS bullseye [10:35:48] (03CR) 10Hnowlan: [C:03+1] mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist [puppet] - 10https://gerrit.wikimedia.org/r/1149629 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [10:35:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10851798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host thanos-be1006.eqiad.wmnet with OS bullsey... [10:36:10] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist [puppet] - 10https://gerrit.wikimedia.org/r/1149629 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [10:36:26] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1007.eqiad.wmnet with OS bullseye [10:36:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10851802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host thanos-be1007.eqiad.wmnet with OS bul... [10:37:57] (03PS3) 10Effie Mouzeli: WIP: adding mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 [10:39:14] (03CR) 10CI reject: [V:04-1] WIP: adding mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (owner: 10Effie Mouzeli) [10:42:02] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:42:24] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:42:27] (03PS6) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) [10:45:37] !log Manual run of purge-securepollvotedata - T388542 [10:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:42] T388542: Migrate trust_and_safety_product_team jobs to mw-cron - https://phabricator.wikimedia.org/T388542 [10:45:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76418 and previous config saved to /var/cache/conftool/dbconfig/20250523-104543-root.json [10:46:40] (03PS1) 10Brouberol: Enable talking to schema.discovery.wmnet via the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149638 (https://phabricator.wikimedia.org/T392668) [10:47:08] (03CR) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:47:19] (03CR) 10Ilias Sarantopoulos: "@tklausmann@wikimedia.org could you review and merge please? I don't have +2 access." [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [10:47:59] (03PS7) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) [10:48:10] (03PS1) 10Brouberol: airflow: disable hardcoded networkpolicy in favor of the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149639 (https://phabricator.wikimedia.org/T392668) [10:50:16] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10851841 (10MoritzMuehlenhoff) [10:51:05] (03CR) 10Ilias Sarantopoulos: ml-services: edit-check latest image deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149632 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [10:51:59] (03CR) 10Gkyziridis: [C:03+1] httpbb(liftwing): add edit-check tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [10:52:27] (03CR) 10Btullis: [C:03+1] Enable talking to schema.discovery.wmnet via the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149638 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [10:52:41] (03CR) 10Btullis: [C:03+1] airflow: disable hardcoded networkpolicy in favor of the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149639 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [10:53:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1149630 (owner: 10Majavah) [10:53:59] (03PS8) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) [10:55:09] (03CR) 10Joal: [C:03+1] Enable talking to schema.discovery.wmnet via the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149638 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [10:55:35] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2004.codfw.wmnet [10:55:36] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2004.codfw.wmnet [10:56:08] (03PS2) 10Gkyziridis: ml-services: edit-check latest image deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149632 (https://phabricator.wikimedia.org/T394779) [10:56:19] (03CR) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [10:56:54] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2004.codfw.wmnet [10:57:20] (03CR) 10Joal: [C:03+1] airflow: disable hardcoded networkpolicy in favor of the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149639 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [10:57:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2004.codfw.wmnet [10:58:00] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: Move away from manual @resolve syntax [puppet] - 10https://gerrit.wikimedia.org/r/1149630 (owner: 10Majavah) [10:59:46] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1007.eqiad.wmnet with reason: host reimage [10:59:47] (03PS2) 10Majavah: P:openstack: Pass ports as numbers to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1149631 [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250523T0700) [11:00:05] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:00:05] jelto, arnoldokoth, and mutante: May I have your attention please! GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250523T1100) [11:00:25] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:01:15] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1008.eqiad.wmnet with OS bullseye [11:01:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10851852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host thanos-be1008.eqiad.wmnet with OS bul... [11:03:24] (03CR) 10Gkyziridis: [C:03+1] httpbb(liftwing): add edit-check tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [11:03:48] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2004.codfw.wmnet [11:03:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2004.codfw.wmnet [11:03:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1007.eqiad.wmnet with reason: host reimage [11:04:32] (03CR) 10Majavah: [C:03+2] P:openstack: Pass ports as numbers to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1149631 (owner: 10Majavah) [11:05:48] (03PS9) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) [11:06:31] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2004.codfw.wmnet [11:06:38] (03CR) 10Ilias Sarantopoulos: httpbb(liftwing): add edit-check tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [11:06:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2004.codfw.wmnet [11:07:14] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2004.codfw.wmnet [11:07:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2004.codfw.wmnet [11:10:41] (03PS10) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) [11:10:56] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:17:41] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149632 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [11:18:32] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:23:11] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1008.eqiad.wmnet with reason: host reimage [11:26:21] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1008.eqiad.wmnet with reason: host reimage [11:37:17] (03PS1) 10Vgutierrez: systemd::timer: Allow setting FixedRandomDelay [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) [11:37:19] (03PS1) 10Vgutierrez: systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) [11:39:22] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5672/console" [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [11:41:08] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5673/console" [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [11:47:37] !log ladsgroup@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:49:35] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [11:51:23] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [11:51:24] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1007.eqiad.wmnet with OS bullseye [11:51:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10851937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host thanos-be1007.eqiad.wmnet with OS bullsey... [11:51:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [11:54:08] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1009.eqiad.wmnet with OS bullseye [11:54:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10851939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host thanos-be1009.eqiad.wmnet with OS bul... [11:54:53] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [11:55:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [11:55:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1008.eqiad.wmnet with OS bullseye [11:55:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10851943 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host thanos-be1008.eqiad.wmnet with OS bullsey... [11:56:06] (03PS2) 10Vgutierrez: systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) [11:58:38] (03PS1) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) [12:01:12] (03CR) 10CI reject: [V:04-1] varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [12:03:33] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:07] (03PS2) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) [12:05:45] (03PS3) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) [12:08:51] (03CR) 10CI reject: [V:04-1] varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [12:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:32] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:20:03] (03PS4) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) [12:21:42] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1009.eqiad.wmnet with reason: host reimage [12:21:57] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [12:25:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1009.eqiad.wmnet with reason: host reimage [12:29:12] (03CR) 10Gkyziridis: [C:03+2] ml-services: edit-check latest image deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149632 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [12:32:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10852117 (10Jclark-ctr) sorry for spam forgot to update ticket number on running reimage on another host [12:33:24] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:35:40] (03PS2) 10AOkoth: doc: swap doc1003 with doc1004 [puppet] - 10https://gerrit.wikimedia.org/r/1149469 [12:36:44] (03CR) 10AOkoth: "Ack. I think I did it in my head. 😕" [puppet] - 10https://gerrit.wikimedia.org/r/1149469 (owner: 10AOkoth) [12:37:12] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:37:32] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10852125 (10Jclark-ctr) @MatthewVernon Thank you for your assistance. In most cases, I am able to image a server and have it successfully pass Pupp... [12:37:34] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:39:17] (03CR) 10Brouberol: [C:03+2] Enable talking to schema.discovery.wmnet via the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149638 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [12:40:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:40:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:44:57] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [12:45:37] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [12:45:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1009.eqiad.wmnet with OS bullseye [12:45:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10852139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host thanos-be1009.eqiad.wmnet with OS bullsey... [12:50:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10852153 (10MatthewVernon) >>! In T392909#10852125, @Jclark-ctr wrote: > @MatthewVernon Thank you for your assistance. In most cases, I am able to... [12:52:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10852155 (10Jclark-ctr) 05Open→03Resolved Thanks for your assistance [12:53:03] (03CR) 10Hashar: "I was challenging the need to support an alternate repo since apparently the sole usage could switch from `peer` to `origin` which would t" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [12:53:40] (03CR) 10Hashar: [C:03+1] git::clone: set given remote name on initial cloning [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [12:53:52] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for lsobanski - https://phabricator.wikimedia.org/T395110#10852165 (10ABran-WMF) 05Open→03Resolved a:03ABran-WMF {T395094} and {T395110} done [12:54:04] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10852171 (10ABran-WMF) 05Open→03Resolved a:03ABran-WMF {T395094} and {T395110} done [12:54:25] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [12:57:26] fnegri@cumin1002 update-views (PID 1213905) is awaiting input [13:02:23] (03PS14) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [13:02:24] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [13:04:04] (03CR) 10Federico Ceratto: "Updated as described" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [13:04:32] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, 07Jenkins: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749#10852258 (10hashar) >>! In T394749#10850867, @Dzahn wrote: > @hashar So it requires 2 things, membe... [13:04:33] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10852259 (10MatthewVernon) [13:07:17] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [13:08:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10852293 (10Gehel) [13:08:41] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10852317 (10Gehel) [13:09:09] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10852329 (10Gehel) [13:09:19] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10852335 (10Gehel) [13:09:30] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10852341 (10Gehel) [13:09:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10852345 (10Gehel) [13:11:20] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10852377 (10Gehel) [13:11:30] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10852381 (10Gehel) [13:12:08] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10852395 (10Gehel) [13:12:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10852399 (10Gehel) [13:12:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10852401 (10Gehel) [13:12:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10852405 (10Gehel) [13:12:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10852403 (10Gehel) [13:12:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10852407 (10Gehel) [13:13:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10852409 (10Gehel) [13:13:11] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10852412 (10Gehel) [13:13:57] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10852430 (10Gehel) [13:15:33] (03CR) 10Marostegui: [C:03+1] sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [13:16:25] fnegri@cumin1002 update-views (PID 1213905) is awaiting input [13:18:44] 06SRE, 10SRE-Access-Requests: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125 (10KCVelaga_WMF) 03NEW [13:18:46] deploying a security patch [13:21:21] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10852497 (10KCVelaga_WMF) @Ahoelzl this might need your approval as well. Please see this [[ https://wikimedia.slack.com/archives/CSV483812/p1747735432060449 | Slack discuss... [13:26:45] (03PS1) 10Slyngshede: CAS: 7.2.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 [13:26:48] (03PS1) 10Brouberol: admin/data: add kcvelaga to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1149666 (https://phabricator.wikimedia.org/T393998) [13:33:50] (03PS2) 10Brouberol: admin/data: add kcvelaga to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1149666 (https://phabricator.wikimedia.org/T393998) [13:36:51] (03CR) 10Fabfur: varnish: Deploy edge uniques experiment fetcher (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:38:51] (03CR) 10Federico Ceratto: [C:03+2] sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [13:38:56] (03CR) 10Fabfur: [C:03+1] systemd::timer: Allow setting FixedRandomDelay [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:39:46] (03CR) 10Fabfur: [C:03+1] systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:40:40] (03CR) 10Alexandros Kosiaris: [C:03+1] "Overall LGTM, minus the comment by Janis." [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [13:42:15] (03CR) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:45:40] (03Merged) 10jenkins-bot: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [13:45:54] !log deployed private mitigation for T395073 [13:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:08] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [13:48:07] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [13:48:26] (03CR) 10Ssingh: [C:03+1] systemd::timer: Allow setting FixedRandomDelay [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:49:56] (03CR) 10Ssingh: [C:03+1] systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:50:08] (03CR) 10Fabfur: varnish: Deploy edge uniques experiment fetcher (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:51:32] (03PS8) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 [13:54:13] (03CR) 10Tiziano Fogli: "Functionality was tested successfully on the Pontoon environment." [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (owner: 10Tiziano Fogli) [13:54:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10852603 (10MatthewVernon) @Jclark-ctr @wiki_willy any idea when that might be or if there's anywhere else these servers could go? Once they're installed... [13:55:39] (03PS1) 10Brouberol: airflow: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149673 [13:55:39] (03PS1) 10Brouberol: blunderbuss: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149674 [13:55:39] (03PS1) 10Brouberol: spark-history: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149675 [13:55:40] (03PS1) 10Brouberol: superset: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149676 [13:55:57] (03CR) 10Fabfur: varnish: Deploy edge uniques experiment fetcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:56:30] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:56:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:59:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:00:10] (03CR) 10Btullis: [C:03+1] airflow: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149673 (owner: 10Brouberol) [14:00:20] (03CR) 10Btullis: [C:03+1] blunderbuss: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149674 (owner: 10Brouberol) [14:00:36] (03CR) 10Btullis: [C:03+1] spark-history: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149675 (owner: 10Brouberol) [14:00:46] (03CR) 10Btullis: [C:03+1] superset: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149676 (owner: 10Brouberol) [14:00:59] (03CR) 10Brouberol: [C:03+2] airflow: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149673 (owner: 10Brouberol) [14:01:02] (03CR) 10Brouberol: [C:03+2] blunderbuss: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149674 (owner: 10Brouberol) [14:01:05] (03CR) 10Brouberol: [C:03+2] spark-history: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149675 (owner: 10Brouberol) [14:01:07] (03CR) 10Brouberol: [C:03+2] superset: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149676 (owner: 10Brouberol) [14:03:02] (03Merged) 10jenkins-bot: airflow: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149673 (owner: 10Brouberol) [14:03:11] (03Merged) 10jenkins-bot: blunderbuss: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149674 (owner: 10Brouberol) [14:03:12] (03Merged) 10jenkins-bot: spark-history: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149675 (owner: 10Brouberol) [14:03:27] (03Merged) 10jenkins-bot: superset: update kadmin server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149676 (owner: 10Brouberol) [14:03:28] (03CR) 10Ssingh: "Looks good except for the comments. I also don't have much of an opinion on the fetcher.py file right now -- I think once you finalize it " [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [14:05:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [14:05:04] fnegri@cumin1002 update-views (PID 1268512) is awaiting input [14:05:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [14:06:44] (03PS5) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) [14:06:55] (03CR) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [14:07:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [14:07:38] (03PS6) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) [14:07:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [14:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:45] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1096.eqiad.wmnet|elastic1097.eqiad.wmnet|elastic1098.eqiad.wmnet|elastic1099.eqiad.wmnet|elastic1100.eqiad.wmnet|elastic1101.eqiad.wmnet|elastic1102.eqiad.wmnet|elastic1107.eqiad.wmnet|elastic1110.eqiad.wmnet [14:10:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [14:10:28] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [14:11:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [14:12:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [14:13:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [14:14:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [14:14:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:15:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:15:19] (03PS9) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 [14:16:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:16:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:17:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [14:18:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [14:18:57] 06SRE, 06cloud-services-team, 06DC-Ops: Supporting new hardware in older debian releases - https://phabricator.wikimedia.org/T301162#10852638 (10taavi) 05Open→03Resolved I don't see any specific problems here that need addressing so closing in order to get this to stop lingering on our workboard. [14:19:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [14:19:38] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [14:20:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [14:22:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:23:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:23:22] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10852644 (10RobH) >>! In T387231#10851628, @tappof wrote: > Sure @RobH, nothing will catch fire because of this patch (or maybe... [14:24:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:24:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:25:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:26:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:26:41] (03PS1) 10Fabfur: external_cloud_vendors: temporary commented Azure fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149681 (https://phabricator.wikimedia.org/T395127) [14:27:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [14:28:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [14:29:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:30:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:30:36] (03CR) 10Ssingh: [C:03+1] varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [14:31:53] (03CR) 10Vgutierrez: [C:03+1] external_cloud_vendors: temporary commented Azure fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149681 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [14:32:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:32:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:35:00] (03PS1) 10Andrea Denisse: alerts: Change team receiving alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149682 (https://phabricator.wikimedia.org/T395117) [14:35:11] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] alerts: Change team receiving alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149682 (https://phabricator.wikimedia.org/T395117) (owner: 10Andrea Denisse) [14:35:24] (03CR) 10Fabfur: [C:03+2] external_cloud_vendors: temporary commented Azure fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149681 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [14:44:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:22] (03CR) 10Klausman: [C:03+1] profile::prometheus::k8s: drop terminated pod targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [14:46:02] (03CR) 10Klausman: [C:03+1] httpbb(liftwing): add edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1149634 (https://phabricator.wikimedia.org/T394779) (owner: 10Ilias Sarantopoulos) [14:58:25] (03PS1) 10Bking: cirrussearch: add cirrussearch row E/remove elastic row F [puppet] - 10https://gerrit.wikimedia.org/r/1149687 (https://phabricator.wikimedia.org/T388610) [14:59:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149687 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:01:44] (03PS2) 10Scott French: Profile::Mediawiki_deployment: add 'clusters' field [puppet] - 10https://gerrit.wikimedia.org/r/1148480 (https://phabricator.wikimedia.org/T388761) [15:02:00] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148480 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:29] (03PS1) 10Muehlenhoff: Record LDAP access for ericmill [puppet] - 10https://gerrit.wikimedia.org/r/1149688 [15:08:46] (03CR) 10Btullis: [C:03+1] cirrussearch: add cirrussearch row E/remove elastic row F [puppet] - 10https://gerrit.wikimedia.org/r/1149687 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:09:48] (03CR) 10Bking: [C:03+2] cirrussearch: add cirrussearch row E/remove elastic row F [puppet] - 10https://gerrit.wikimedia.org/r/1149687 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:09:58] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for ericmill [puppet] - 10https://gerrit.wikimedia.org/r/1149688 (owner: 10Muehlenhoff) [15:10:15] (03PS5) 10Scott French: profile::prometheus::k8s: drop terminated pod targets [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) [15:11:53] (03CR) 10Cathal Mooney: [C:03+1] systemd::timer: Allow setting FixedRandomDelay [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [15:11:59] (03CR) 10Cathal Mooney: [C:03+1] systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [15:15:16] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.eqiad.wmnet [15:15:44] (03CR) 10Klausman: [C:03+1] profile::prometheus::k8s: drop terminated pod targets [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:22] (03PS1) 10Vgutierrez: hiera: Use GTS staging account in acmechief-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1149692 [15:17:26] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1096 to cirrussearch1096 [15:17:38] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:17:45] (03CR) 10Scott French: "Thank you all for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [15:18:33] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:33] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:18:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149692 (owner: 10Vgutierrez) [15:19:09] (03PS1) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) [15:19:47] (03CR) 10Clément Goubert: [C:03+1] Profile::Mediawiki_deployment: add 'clusters' field [puppet] - 10https://gerrit.wikimedia.org/r/1148480 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [15:19:58] (03PS2) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) [15:20:20] (03PS3) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) [15:21:17] (03CR) 10JHathaway: systemd::timer: Allow setting FixedRandomDelay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [15:21:27] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1096 to cirrussearch1096 - bking@cumin2002" [15:21:33] (03CR) 10CI reject: [V:04-1] external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [15:22:06] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1096 to cirrussearch1096 - bking@cumin2002" [15:22:06] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:06] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1096 on all recursors [15:22:09] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1096 on all recursors [15:22:11] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1096 [15:22:18] (03CR) 10Scott French: [C:03+2] Profile::Mediawiki_deployment: add 'clusters' field [puppet] - 10https://gerrit.wikimedia.org/r/1148480 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [15:22:22] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1096 [15:23:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1096 to cirrussearch1096 [15:23:36] (03CR) 10Btullis: "I'd like to think that we won't need this now, so I would be tempted not to proceed with it. But feel free to convince me otherwise." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149013 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [15:23:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1096.eqiad.wmnet with OS bullseye [15:23:41] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1096 [15:23:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1096 [15:26:34] (03CR) 10Ssingh: [C:03+1] "nit in commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1149692 (owner: 10Vgutierrez) [15:31:13] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1097 to cirrussearch1097 [15:31:27] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:31:31] (03CR) 10Vgutierrez: external_cloud_vendors: fix Azure prefix fetch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [15:32:42] (03CR) 10Ssingh: [C:03+1] hiera: Use GTS staging account in acmechief-test2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149692 (owner: 10Vgutierrez) [15:33:05] (03CR) 10Scott French: "Alright, [0] has been merged, so you should be good to update this patch to include the `clusters` field. That said, [1] has not yet been " [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [15:33:09] (03CR) 10Vgutierrez: [C:03+2] hiera: Use GTS staging account in acmechief-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1149692 (owner: 10Vgutierrez) [15:34:28] (03PS4) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) [15:34:42] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1097 to cirrussearch1097 - bking@cumin2002" [15:35:07] (03CR) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [15:35:52] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1097 to cirrussearch1097 - bking@cumin2002" [15:35:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:53] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1097 on all recursors [15:35:56] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1097 on all recursors [15:35:57] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1097 [15:36:09] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1097 [15:36:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1097 to cirrussearch1097 [15:38:11] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1096.eqiad.wmnet with reason: host reimage [15:38:19] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1097.eqiad.wmnet with OS bullseye [15:38:23] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1097 [15:38:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1097 [15:42:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1096.eqiad.wmnet with reason: host reimage [15:42:22] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10852861 (10MatthewVernon) A few TB of quota shouldn't be a problem; how many objects per bucket are you looking at? We get better performance out of fewer larger o... [15:45:30] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10852874 (10MatthewVernon) @Ladsgroup can you let me know when one of the current batch has finished, please? Now we've done the thumbnail defrag stuff, I'd like to re-asses (for the... [15:52:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1097.eqiad.wmnet with reason: host reimage [15:53:58] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10852902 (10Seddon) @BCornwall @Eevans @Dzahn: Apologies for the delay! I've posted at https://wikitech.wikimedia.org/wiki/User:Seddon_(WMF)/public_keys [15:56:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1097.eqiad.wmnet with reason: host reimage [15:57:43] ACKNOWLEDGEMENT - Host cirrussearch2115 is DOWN: PING CRITICAL - Packet loss = 100% Brian_King rebooted for fw update [16:01:55] (03PS2) 10Vgutierrez: systemd::timer: Allow setting FixedRandomDelay [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) [16:01:55] (03PS3) 10Vgutierrez: systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) [16:01:55] (03PS7) 10Vgutierrez: varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) [16:02:11] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:02:28] (03CR) 10Vgutierrez: systemd::timer: Allow setting FixedRandomDelay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [16:03:33] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:02] (03PS2) 10Effie Mouzeli: WIP: profile::kubernetes::node: Add script to pull and mount latest mw [puppet] - 10https://gerrit.wikimedia.org/r/1148905 (https://phabricator.wikimedia.org/T276994) [16:04:52] (03PS3) 10Effie Mouzeli: WIP: profile::kubernetes::node: Add script to pull and mount latest mw [puppet] - 10https://gerrit.wikimedia.org/r/1148905 (https://phabricator.wikimedia.org/T276994) [16:06:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1096.eqiad.wmnet with OS bullseye [16:07:10] (03CR) 10JHathaway: [C:03+1] systemd::timer: Allow setting FixedRandomDelay [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [16:07:24] (03CR) 10JHathaway: [C:03+1] systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [16:07:31] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2115 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:09:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:31] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2115 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:17:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1097.eqiad.wmnet with OS bullseye [16:18:33] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:23:14] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10852994 (10BTullis) Hmm. The `deployment` group brings a lot of power with it, though. I'm not sure that all of our possible Airflow developers would... [16:26:53] (03PS1) 10Tchanders: Temp accounts: Allow sysop/steward to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T393615) [16:27:52] (03PS2) 10Tchanders: Temp accounts: Allow sysop/steward to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) [16:34:03] (03CR) 10JJMC89: "Stewards have `userrights` (locally) and `userrights-interwiki` (on metawiki, where the changes would actually be done), so this should no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [16:35:21] (03CR) 10Dreamy Jazz: "+1 to this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [16:39:54] 10SRE-swift-storage, 10MediaWiki-Uploading, 07Wikimedia-production-error: UploadChunkFileException: Error storing file in '{chunkPath}': backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T395049#10853050 (10MatthewVernon) I had a look in the swift logs for the associated item (per... [16:43:39] (03CR) 10Dreamy Jazz: mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149629 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [16:47:03] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10853133 (10ABran-WMF) 05Resolved→03Open @wiki_willy mentioned to me that @SDeckelmann-WMF needed to access netbox as well, I'm handing this over to @Dzahn if this is time sensitive as... [16:51:09] (03PS3) 10Tchanders: Temp accounts: Allow sysop/steward to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) [16:51:45] (03PS4) 10Tchanders: Temp accounts: Allow sysop to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) [17:00:01] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10853174 (10Dzahn) a:05ABran-WMF→03Dzahn [17:02:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:02:24] (03PS5) 10Tchanders: Temp accounts: Allow sysop/steward to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) [17:02:46] (03CR) 10Tchanders: "I've removed the changes for stewards from the default." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [17:04:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:16] (03CR) 10JJMC89: "> As far as I can tell, it seems that they would need this assigning for metawiki, and that would allow them to grant/revoke the group at " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [17:06:05] (03CR) 10Dreamy Jazz: "https://meta.wikimedia.org/wiki/Special:ListGroupRights says that stewards have the `userrights` permission, which allows the user to "Edi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [17:06:43] (03CR) 10Dreamy Jazz: "Also to what JJMCC89 said." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [17:09:15] (03CR) 10Dreamy Jazz: "AFAICS, it seems that `::getGroupsChangeableBy` returns all groups if the user //locally// has the `userrights` permission even if the cha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [17:13:37] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10853218 (10Dzahn) "netbox access" can still mean some different things. It requires membership in one of these LDAP groups: 'nda' - partial read-only access (this group is typically use... [17:19:58] (03PS6) 10Tchanders: Temp accounts: Allow sysop to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) [17:20:32] (03CR) 10Tchanders: "Ah, I see `userrights` allows you to change all groups. I thought it allowed you to change whichever ones were added to `$wgAddGroups`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [17:21:35] (03CR) 10Dreamy Jazz: [C:03+1] Temp accounts: Allow sysop to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [17:24:19] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1098 to cirrussearch1098 [17:24:31] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:27:54] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1098 to cirrussearch1098 - bking@cumin2002" [17:28:48] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1098 to cirrussearch1098 - bking@cumin2002" [17:28:48] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:28:49] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1098 on all recursors [17:28:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1098 on all recursors [17:28:53] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1098 [17:29:21] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1098 [17:30:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1098 to cirrussearch1098 [17:30:10] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10853269 (10Dzahn) @SDeckelmann-WMF Hey, so.. I checked and you already have membership in the "wmf" LDAP group. So that means you should be able to login on https://netbox.wikimedia.org... [17:30:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1098.eqiad.wmnet with OS bullseye [17:30:48] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1098 [17:30:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1098 [17:32:22] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10853279 (10Dzahn) 05Open→03Resolved resolving! But if there is any issue or more is needed feel free to just reopen it, or we can. (Given the US holiday on Monday and the rota... [17:32:54] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1099 to cirrussearch1099 [17:33:06] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:36:15] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1099 to cirrussearch1099 - bking@cumin2002" [17:36:55] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1099 to cirrussearch1099 - bking@cumin2002" [17:36:55] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:56] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1099 on all recursors [17:36:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1099 on all recursors [17:37:00] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1099 [17:37:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1099 [17:37:35] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10853304 (10SDeckelmann-WMF) Thanks! I can definitely login to netbox, but all of the objects are locked. I'm following the SRE tutorials, so if there's maybe something I missed earlie... [17:37:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1099 to cirrussearch1099 [17:38:22] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1099.eqiad.wmnet with OS bullseye [17:38:26] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1099 [17:38:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1099 [17:50:06] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1098.eqiad.wmnet with reason: host reimage [17:53:41] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10853350 (10Dzahn) Hi Selena, could you link me to the tutorial you are following? [17:53:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1098.eqiad.wmnet with reason: host reimage [17:57:23] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1099.eqiad.wmnet with reason: host reimage [17:58:33] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:02:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1099.eqiad.wmnet with reason: host reimage [18:21:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1099.eqiad.wmnet with OS bullseye [18:28:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1098.eqiad.wmnet with OS bullseye [18:42:59] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:44:48] (03PS1) 10Ebernhardson: Turn on glent m1 AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149720 (https://phabricator.wikimedia.org/T262612) [18:51:46] bking@cumin2002 rename (PID 689782) is awaiting input [18:54:29] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1100 to cirrussearch1100 [18:54:43] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:56:39] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10853518 (10Dzahn) Hey @Jdlrobson @Jdlrobson-WMF you have a user on the deployment servers now. You should now be able to run the `scap spiderpig-otp` command on them to get the access code... [19:00:18] bking@cumin2002 rename (PID 689782) is awaiting input [19:05:43] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1100 to cirrussearch1100 - bking@cumin2002" [19:06:00] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1100 to cirrussearch1100 - bking@cumin2002" [19:06:00] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:06:00] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10853523 (10Dzahn) 05In progress→03Resolved a:03Dzahn feel free to reopen if you run into any issues, cheers! [19:06:01] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1100 on all recursors [19:06:04] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1100 on all recursors [19:06:04] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1100 [19:06:17] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1100 [19:06:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1100 to cirrussearch1100 [19:08:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1100.eqiad.wmnet with OS bullseye [19:08:18] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1100 [19:08:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1100 [19:09:10] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1101 to cirrussearch1101 [19:09:22] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:13:18] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1101 to cirrussearch1101 - bking@cumin2002" [19:15:22] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1101 to cirrussearch1101 - bking@cumin2002" [19:15:22] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:15:22] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1101 on all recursors [19:15:25] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1101 on all recursors [19:15:26] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1101 [19:15:46] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1101 [19:16:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1101 to cirrussearch1101 [19:18:33] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:38] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:22:04] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1101.eqiad.wmnet with OS bullseye [19:22:09] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1101 [19:22:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1101 [19:23:42] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1100.eqiad.wmnet with reason: host reimage [19:26:37] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10853594 (10Dzahn) [19:27:02] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10853597 (10Dzahn) Thank you. Verified :) [19:27:14] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10853598 (10Dzahn) a:05Seddon→03None [19:27:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1100.eqiad.wmnet with reason: host reimage [19:30:35] (03PS1) 10Dzahn: admin: replace SSH key for seddon [puppet] - 10https://gerrit.wikimedia.org/r/1149736 (https://phabricator.wikimedia.org/T393579) [19:36:16] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83448MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [19:41:54] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1101.eqiad.wmnet with reason: host reimage [19:42:52] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10853674 (10Ladsgroup) Sure. In eqiad it's running and it'll take a while [19:45:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1101.eqiad.wmnet with reason: host reimage [19:46:47] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1102 to cirrussearch1102 [19:47:00] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:47:01] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1149544 (owner: 10Muehlenhoff) [19:50:14] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1102 to cirrussearch1102 - bking@cumin2002" [19:52:48] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1102 to cirrussearch1102 - bking@cumin2002" [19:52:48] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:52:48] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1102 on all recursors [19:52:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1102 on all recursors [19:52:52] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1102 [19:55:56] bking@cumin2002 rename (PID 718256) is awaiting input [19:57:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1100.eqiad.wmnet with OS bullseye [19:58:51] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1102 [19:59:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1102 to cirrussearch1102 [20:00:10] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1102.eqiad.wmnet with OS bullseye [20:00:14] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1102 [20:00:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1102 [20:03:33] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:40] bking@cumin2002 rename (PID 727000) is awaiting input [20:09:45] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1107 to cirrussearch1107 [20:09:57] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:10:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1101.eqiad.wmnet with OS bullseye [20:13:25] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1107 to cirrussearch1107 - bking@cumin2002" [20:14:52] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1102.eqiad.wmnet with reason: host reimage [20:15:00] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1107 to cirrussearch1107 - bking@cumin2002" [20:15:01] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:15:01] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1107 on all recursors [20:15:04] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1107 on all recursors [20:15:05] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1107 [20:15:15] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1107 [20:15:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1107 to cirrussearch1107 [20:16:07] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [20:17:01] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1107.eqiad.wmnet with OS bullseye [20:17:06] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1107 [20:17:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1107 [20:18:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1102.eqiad.wmnet with reason: host reimage [20:18:33] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:34:17] bking@cumin2002 reimage (PID 732063) is awaiting input [20:37:05] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1107.eqiad.wmnet with OS bullseye [20:37:31] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149748 [20:37:54] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1107.eqiad.wmnet on all recursors [20:37:57] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1107.eqiad.wmnet on all recursors [20:39:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:43:20] bking@cumin2002 reimage (PID 744874) is awaiting input [20:43:48] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1107.eqiad.wmnet with OS bullseye [20:43:52] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1107 [20:43:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1107 [20:44:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1102.eqiad.wmnet with OS bullseye [20:44:41] RESOLVED: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:51:37] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149751 [20:59:58] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1107.eqiad.wmnet with reason: host reimage [21:03:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1107.eqiad.wmnet with reason: host reimage [21:03:49] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10853772 (10Jdlrobson-WMF) Thanks all! Looking forward to trying this out next week! [21:04:25] FIRING: [7x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host apus-be2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:19:40] jhancock@cumin2002 provision (PID 762713) is awaiting input [21:26:24] 06SRE, 10SRE-Access-Requests: Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167 (10Dzahn) 03NEW [21:27:09] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10853814 (10Dzahn) [21:28:10] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10853830 (10Dzahn) The contint-roots group will be used for access to new VMs created in T394819. @Corvus adding your realname is op... [21:28:46] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10853833 (10Dzahn) @Corvus Please take a look at L3 and sign it if you are comfortable with it. [21:30:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1107.eqiad.wmnet with OS bullseye [21:32:36] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10853851 (10Dzahn) This sounds like a good idea. We already have other groups like that, gerrit-deployers, zuul-deployers, research-deployers, platform... [21:35:39] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10853866 (10Reedy) [21:36:25] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10853867 (10thcipriani) [21:37:01] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10853869 (10thcipriani) Approved as `contint-roots` approver. [21:38:22] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10853873 (10Dzahn) 05Stalled→03In progress [21:49:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-be2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:50:50] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10853926 (10Dzahn) [21:53:13] 06SRE, 10LDAP-Access-Requests: Grant Access to ops-limited for sdeckelmann-wmf - https://phabricator.wikimedia.org/T395094#10853935 (10Dzahn) We chatted a bit about this and it now seems like this is a bug or outdated docs, because she can login but not see any objects despite being in the wmf group. I cr... [21:56:35] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch2110*,cirrussearch2111* for T394543 - bking@cumin2002 [21:56:37] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch2110*,cirrussearch2111* for T394543 - bking@cumin2002 [21:56:39] T394543: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543 [21:56:50] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10853940 (10Dzahn) [21:58:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host apus-be2004.codfw.wmnet with OS bookworm [21:58:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10853942 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host apus-be2004.codfw.wmnet with OS bookworm [21:58:33] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:07:39] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cirrussearch[2110-2111].codfw.wmnet with reason: firmware update [22:07:51] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10853952 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ba793dbe-d223-4f3a-8163-69d6fc192e7f) set by bking@cumin2002 for... [22:22:57] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10853990 (10bking) Hey Volans, I flipped the script a bit to make it a bit more readable... > Anyway, let's look at the future :) > > Today I've run some te... [22:34:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-be2004.codfw.wmnet with reason: host reimage [22:37:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-be2004.codfw.wmnet with reason: host reimage [22:55:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:58:34] jhancock@cumin2002 reimage (PID 782301) is awaiting input [23:01:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:01:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be2004.codfw.wmnet with OS bookworm [23:01:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10854041 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-be2004.codfw.wmnet with OS bookworm completed: - apus-be2004 (**P... [23:02:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10854042 (10Jhancock.wm) 05Open→03Resolved [23:02:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10854046 (10Jhancock.wm) @MatthewVernon @Jclark-ctr thanks for troubleshooting on the other one. This one is ready to go [23:18:33] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:38] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:23:36] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10854075 (10wiki_willy) Hi @MatthewVernon - I just replied back to your email with a more in-depth explanation. The short answer though is that we need m... [23:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1149786 [23:38:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1149786 (owner: 10TrainBranchBot) [23:50:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1149786 (owner: 10TrainBranchBot)