[00:00:01] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1051.eqiad.wmnet with reason: host reimage [00:02:45] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1050.eqiad.wmnet with OS bookworm [00:02:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11155906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1050.eqiad.wmnet with OS bookworm [00:04:00] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1051.eqiad.wmnet with reason: host reimage [00:04:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:07:06] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [00:07:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1185439 [00:07:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1185439 (owner: 10TrainBranchBot) [00:09:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11155907 (10VRiley-WMF) [00:10:40] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1052 - vriley@cumin1003" [00:10:44] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1052 - vriley@cumin1003" [00:10:44] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:11:03] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es1052 [00:12:23] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1052 [00:13:20] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:21:23] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:21:56] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:21:57] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1051.eqiad.wmnet with OS bookworm [00:22:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11155908 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1051.eqiad.wmnet with OS bookworm completed: - es1051 (**PASS**) -... [00:27:03] vriley@cumin1003 reimage (PID 1493635) is awaiting input [00:30:18] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1185439 (owner: 10TrainBranchBot) [00:35:41] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:43:31] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1050.eqiad.wmnet with OS bookworm [00:43:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11155919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1050.eqiad.wmnet with OS bookworm executed with errors: - es1050 (**F... [01:00:40] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:13:22] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 12m 42s) [01:18:44] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1052.eqiad.wmnet with OS bookworm [01:18:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11155959 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1052.eqiad.wmnet with OS bookworm [01:22:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:32:55] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:09] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:36:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:39:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:49:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:50:01] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1052.eqiad.wmnet with reason: host reimage [01:53:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1052.eqiad.wmnet with reason: host reimage [02:11:35] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [02:12:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [02:12:43] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1052.eqiad.wmnet with OS bookworm [02:12:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1052.eqiad.wmnet with OS bookworm completed: - es1052 (**PASS**) -... [02:16:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:36:34] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:36:34] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:37:34] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:37:34] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:38:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:43:10] RESOLVED: [3x] BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:55:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156003 (10VRiley-WMF) [03:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:12:28] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [03:17:50] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [03:20:13] !log vriley@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [03:20:32] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [03:26:08] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [03:26:25] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1053 - vriley@cumin1003" [03:26:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1053 - vriley@cumin1003" [03:26:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:26:47] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es1053 [03:28:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1053 [03:28:44] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:28:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:29:06] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es1054 [03:30:19] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1054 [03:30:52] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:32:27] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a5-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403867#11156026 (10VRiley-WMF) a:03VRiley-WMF [03:32:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a5-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403867#11156029 (10VRiley-WMF) 05Open→03Resolved rebalanced power [03:35:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156032 (10VRiley-WMF) [03:48:10] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868#11156044 (10MusikAnimal) [03:52:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:54:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:06:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:13:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156050 (10VRiley-WMF) [04:27:06] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1053.eqiad.wmnet with OS bookworm [04:27:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156051 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1053.eqiad.wmnet with OS bookworm [04:28:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156052 (10VRiley-WMF) [04:31:01] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [04:32:03] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1054.eqiad.wmnet with OS bookworm [04:32:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1054.eqiad.wmnet with OS bookworm [04:35:50] (03PS1) 10KartikMistry: Update MinT to 2025-09-03-160715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185525 (https://phabricator.wikimedia.org/T400562) [04:36:25] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1055 - vriley@cumin1003" [04:36:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1055 - vriley@cumin1003" [04:36:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:37:08] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es1055 [04:38:24] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1055 [04:40:10] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2025-09-03-160715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185525 (https://phabricator.wikimedia.org/T400562) (owner: 10KartikMistry) [04:42:08] (03Merged) 10jenkins-bot: Update MinT to 2025-09-03-160715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185525 (https://phabricator.wikimedia.org/T400562) (owner: 10KartikMistry) [04:42:17] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:43:15] Deploying MinT. Minor change. [04:43:48] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [04:46:51] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [04:48:13] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [04:50:30] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [04:52:19] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1056 - vriley@cumin1003" [04:52:24] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1056 - vriley@cumin1003" [04:52:24] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:53:14] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:54:10] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1053.eqiad.wmnet with reason: host reimage [04:56:11] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [04:58:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1053.eqiad.wmnet with reason: host reimage [04:59:39] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [05:02:52] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [05:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:05:26] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [05:05:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:06:23] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1057 - vriley@cumin1003" [05:06:27] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1057 - vriley@cumin1003" [05:06:27] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:06:52] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es1057 [05:08:35] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1057 [05:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:39] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1057.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:15:58] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [05:16:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:17:04] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [05:17:05] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1053.eqiad.wmnet with OS bookworm [05:17:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1053.eqiad.wmnet with OS bookworm completed: - es1053 (**PASS**) -... [05:22:28] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1055.eqiad.wmnet with OS bookworm [05:22:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156064 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1055.eqiad.wmnet with OS bookworm [05:23:44] !log Updated MinT to 2025-09-03-160715-production (T400562) [05:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:48] T400562: Create a unified Logstash dashboard displaying errors from cx, cxserver, RecommentationAPI, MinT - https://phabricator.wikimedia.org/T400562 [05:29:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:29:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156066 (10VRiley-WMF) [05:31:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1057.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:32:58] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1054.eqiad.wmnet with OS bookworm [05:33:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1054.eqiad.wmnet with OS bookworm executed with errors: - es1054 (**F... [05:33:34] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1054.eqiad.wmnet with OS bookworm [05:33:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156069 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1054.eqiad.wmnet with OS bookworm [05:33:57] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:36:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:36:59] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [05:37:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [05:44:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:46:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:47:40] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1057.eqiad.wmnet with OS bookworm [05:47:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156075 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1057.eqiad.wmnet with OS bookworm [05:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:49:18] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1055.eqiad.wmnet with reason: host reimage [05:51:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.209 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:54:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1055.eqiad.wmnet with reason: host reimage [06:01:31] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1054.eqiad.wmnet with OS bookworm [06:01:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156079 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1054.eqiad.wmnet with OS bookworm executed with errors: - es1054 (**F... [06:03:14] (03PS5) 10Papaul: Add eqsin private IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 [06:12:26] (03CR) 10Stevemunene: [C:03+1] Bump the size of the java heap for the HDFS namenodes [puppet] - 10https://gerrit.wikimedia.org/r/1185082 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [06:13:49] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [06:15:01] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1057.eqiad.wmnet with reason: host reimage [06:15:24] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 264927 [06:15:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 264927 [06:16:34] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [06:16:35] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1055.eqiad.wmnet with OS bookworm [06:16:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11156081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1055.eqiad.wmnet with OS bookworm completed: - es1055 (**PASS**) -... [06:18:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1057.eqiad.wmnet with reason: host reimage [06:31:35] vriley@cumin1003 reimage (PID 1533737) is awaiting input [06:31:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM. To be on the safe side: When rolling it out, make sure to disable Puppet on all C:bird nodes and then start off with a low impact on" [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [06:33:32] !log hashar@deploy1003 Started deploy [integration/docroot@9830ef2]: Changes to CoveragePage to prepare the phase out of "skin" terminology - T402398 [06:33:35] T402398: Phase out "skin" terminology from CI config, use "extension" instead - https://phabricator.wikimedia.org/T402398 [06:33:45] !log hashar@deploy1003 Finished deploy [integration/docroot@9830ef2]: Changes to CoveragePage to prepare the phase out of "skin" terminology - T402398 (duration: 00m 13s) [06:34:43] (03PS2) 10Muehlenhoff: Add ncredir3006 [puppet] - 10https://gerrit.wikimedia.org/r/1185107 (https://phabricator.wikimedia.org/T402259) [06:34:55] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [06:37:59] vriley@cumin1003 reimage (PID 1534252) is awaiting input [06:38:32] (03CR) 10DCausse: cirrus: Reduce galleries weight in search on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [06:40:24] (03CR) 10Muehlenhoff: [C:03+2] Add ncredir3006 [puppet] - 10https://gerrit.wikimedia.org/r/1185107 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [06:46:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:47:26] (03CR) 10DCausse: cirrus: Reduce galleries weight in search on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [06:48:35] (03PS1) 10Stevemunene: Add prometheus hosts to scrape dse-k8s-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1185703 (https://phabricator.wikimedia.org/T397301) [06:48:37] (03PS1) 10Stevemunene: grafana: Add dse-k8s-codfw prometheus data source [puppet] - 10https://gerrit.wikimedia.org/r/1185704 (https://phabricator.wikimedia.org/T397301) [06:51:19] (03PS1) 10Muehlenhoff: Remove access for frankie [puppet] - 10https://gerrit.wikimedia.org/r/1185705 [06:51:36] (03PS2) 10Muehlenhoff: Remove access for frankie [puppet] - 10https://gerrit.wikimedia.org/r/1185705 [06:56:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.744 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:59:45] (03CR) 10Muehlenhoff: [C:03+2] Remove access for frankie [puppet] - 10https://gerrit.wikimedia.org/r/1185705 (owner: 10Muehlenhoff) [07:00:04] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T0700). [07:00:04] Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:08:43] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir3006.esams.wmnet [07:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:09:06] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir3006.esams.wmnet [07:09:27] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir3004.esams.wmnet [07:09:30] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Fgoodwin out of all services on: 2421 hosts [07:09:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:10:54] (03CR) 10Elukey: [C:03+2] services: exclude postgres masters from confs in tegola/kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185049 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:11:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:35] (03PS1) 10Muehlenhoff: Remove ncredir3004 [puppet] - 10https://gerrit.wikimedia.org/r/1185709 (https://phabricator.wikimedia.org/T402259) [07:11:39] (03Abandoned) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:12:28] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [07:13:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1184388 (https://phabricator.wikimedia.org/T398600) (owner: 10Elukey) [07:14:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:16:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.815 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:17:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow3004.esams.wmnet [07:17:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:19:39] (03PS1) 10Muehlenhoff: Apply netinsights role to netflow3004 [puppet] - 10https://gerrit.wikimedia.org/r/1185710 (https://phabricator.wikimedia.org/T402259) [07:22:27] (03PS2) 10Muehlenhoff: Apply netinsights role to netflow3004 [puppet] - 10https://gerrit.wikimedia.org/r/1185710 (https://phabricator.wikimedia.org/T402259) [07:22:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow3004.esams.wmnet - jmm@cumin2002" [07:24:03] (03CR) 10Elukey: [V:03+2 C:03+2] Release upstream version 1.31.0.8 [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1184388 (https://phabricator.wikimedia.org/T398600) (owner: 10Elukey) [07:25:09] (03CR) 10Ayounsi: [C:03+1] Apply netinsights role to netflow3004 [puppet] - 10https://gerrit.wikimedia.org/r/1185710 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:25:51] jmm@cumin2002 makevm (PID 558602) is awaiting input [07:33:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow3004.esams.wmnet - jmm@cumin2002" [07:33:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:33:05] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow3004.esams.wmnet on all recursors [07:33:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow3004.esams.wmnet on all recursors [07:33:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow3004.esams.wmnet - jmm@cumin2002" [07:33:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow3004.esams.wmnet - jmm@cumin2002" [07:33:48] (03CR) 10Filippo Giunchedi: "I'm no longer part of o11y team, adding o11y folks" [puppet] - 10https://gerrit.wikimedia.org/r/1185704 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [07:33:54] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11156268 (10JMeybohm) a:05JMeybohm→03None Removing assignment due to end of clinic duty shift [07:34:05] (03CR) 10Muehlenhoff: [C:03+2] imposm: Drop quiet from start flags [puppet] - 10https://gerrit.wikimedia.org/r/1185104 (owner: 10Muehlenhoff) [07:36:52] jmm@cumin2002 makevm (PID 558602) is awaiting input [07:37:31] (03CR) 10Filippo Giunchedi: [V:03+1] "Ack, will do" [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:37:34] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] bird: use LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:37:39] (03PS3) 10Filippo Giunchedi: bird: use LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) [07:38:33] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] bird: use LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:40:05] (03CR) 10Ayounsi: [C:03+1] "lgtm, but yeah, please do a careful rollout :)" [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:42:21] (03PS1) 10STran: Enable temporary accounts on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185838 (https://phabricator.wikimedia.org/T402181) [07:42:44] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1185059 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [07:42:50] (03PS2) 10Stevemunene: druid: Bring druid1012.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182698 (https://phabricator.wikimedia.org/T397441) [07:42:50] (03PS2) 10Stevemunene: druid: Bring druid1013.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182699 (https://phabricator.wikimedia.org/T397441) [07:42:50] (03PS2) 10Stevemunene: druid: Add druid druid101[2-3] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1182700 (https://phabricator.wikimedia.org/T397441) [07:42:50] (03PS1) 10Stevemunene: druid: Change druid host used to run refinery data puge job [puppet] - 10https://gerrit.wikimedia.org/r/1185839 [07:42:51] (03PS1) 10Stevemunene: druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) [07:43:26] XioNoX: yes indeed re: rollout, I just applied on durum6001 while other bird hosts have puppet disabled, what's the easiest way to check bird is working as expected ? [07:44:41] godog: `durum6001:~$ sudo birdc show protocol bfd1` [07:44:47] it's up so it's good [07:44:56] bgp is established as wel [07:44:59] l [07:45:31] ok sweet, thank you [07:45:45] I'll do centrallog as recommended by moritzm [07:45:48] cool [07:46:13] !log upgrading Envoy on an-web, an-tool1007 (turnilo), an-tool1008 (yarn) T402584 [07:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:18] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [07:48:26] XioNoX: ok done centrallog2002, if the firewall was blocking bird traffic how long would it take bgp and bfd to notice on the bird side ? [07:49:04] bfd about 3x300ms [07:49:16] and bfd would take down bgp [07:49:30] ok! safe to say we would have already noticed [07:49:39] yep :) [07:50:07] ok I'll re-enable puppet on a subset of hosts and check those, so far so good [07:50:22] great, and thanks for the change it's much better [07:51:02] (03PS1) 10Brouberol: clouddumps: allow DSE pods to acess the https port [puppet] - 10https://gerrit.wikimedia.org/r/1185843 (https://phabricator.wikimedia.org/T402784) [07:51:18] you're welcome, it all started with me botching the first version of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184792 [07:51:24] and then doing it properly the next time [07:51:51] (03PS7) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1184036 (https://phabricator.wikimedia.org/T402611) [07:51:51] (03CR) 10Arnaudb: [C:03+2] "I'll preshot the revert" [puppet] - 10https://gerrit.wikimedia.org/r/1184036 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [07:52:29] (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185844 [07:53:02] (03PS2) 10Brouberol: clouddumps: allow DSE pods to acess the https port [puppet] - 10https://gerrit.wikimedia.org/r/1185843 (https://phabricator.wikimedia.org/T402784) [07:53:52] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6873/co" [puppet] - 10https://gerrit.wikimedia.org/r/1185843 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [07:54:08] (03PS1) 10STran: Enable temporary accounts on all medium-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) [07:54:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow3004.esams.wmnet with OS bookworm [07:54:51] (03PS1) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185846 (https://phabricator.wikimedia.org/T403838) [07:55:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11156321 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow3004.esams.wmnet with OS bookworm [07:56:12] !log finished rollout of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184793 - puppet re-enabled on C:bird [07:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:16] \o/ [07:57:55] (03PS2) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185846 (https://phabricator.wikimedia.org/T403838) [07:59:19] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185844 (owner: 10Arnaudb) [07:59:37] (03CR) 10DCausse: [C:03+1] "thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185846 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:00:24] (03CR) 10Kosta Harlan: [C:03+1] Enable temporary accounts on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185838 (https://phabricator.wikimedia.org/T402181) (owner: 10STran) [08:00:58] (03PS1) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185847 [08:01:14] (03PS2) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185847 [08:04:42] (03CR) 10Brouberol: [C:03+2] flink-kubernetes-operator: upgrade to 1.12.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185846 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:04:45] (03CR) 10Brouberol: [V:03+2 C:03+2] flink-kubernetes-operator: upgrade to 1.12.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185846 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:06:01] (03PS2) 10STran: Enable temporary accounts on all medium-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185845 (https://phabricator.wikimedia.org/T403399) [08:09:39] (03PS4) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185847 [08:09:39] (03CR) 10Arnaudb: [C:03+2] "identified bugs fixed, will preshot sanity revert" [puppet] - 10https://gerrit.wikimedia.org/r/1185847 (owner: 10Arnaudb) [08:10:13] (03PS1) 10Arnaudb: Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185848 [08:11:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:16] (03PS1) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185849 (https://phabricator.wikimedia.org/T403838) [08:12:18] (03PS1) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185850 (https://phabricator.wikimedia.org/T403838) [08:12:20] (03PS1) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185851 (https://phabricator.wikimedia.org/T403838) [08:12:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2003.codfw.wmnet [08:12:22] (03PS1) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185852 (https://phabricator.wikimedia.org/T403838) [08:12:24] (03PS1) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185853 (https://phabricator.wikimedia.org/T403838) [08:12:26] (03PS1) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185854 (https://phabricator.wikimedia.org/T403838) [08:12:45] (03PS2) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185850 (https://phabricator.wikimedia.org/T403838) [08:12:45] (03PS2) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185849 (https://phabricator.wikimedia.org/T403838) [08:12:45] (03PS2) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185852 (https://phabricator.wikimedia.org/T403838) [08:12:45] (03PS2) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185851 (https://phabricator.wikimedia.org/T403838) [08:12:46] (03PS2) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185853 (https://phabricator.wikimedia.org/T403838) [08:12:47] (03PS2) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185854 (https://phabricator.wikimedia.org/T403838) [08:13:09] (03CR) 10DCausse: [C:03+1] flink-kubernetes-operator: upgrade to 1.12.1 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185850 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:13:16] !log klausman@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:13:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow3004.esams.wmnet with reason: host reimage [08:15:38] (03CR) 10DCausse: [C:03+1] flink-kubernetes-operator: upgrade to 1.12.1 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185849 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:16:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2003.codfw.wmnet [08:16:33] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.961 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:16:49] PROBLEM - Host ml-serve1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:17:41] (03CR) 10Brouberol: [C:03+1] "That looks correct to me, but I'll let o11y members weight in" [puppet] - 10https://gerrit.wikimedia.org/r/1185703 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [08:18:10] (03CR) 10Brouberol: [C:03+1] "LGTM, but I'll let o11y members weigh in" [puppet] - 10https://gerrit.wikimedia.org/r/1185704 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [08:19:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow3004.esams.wmnet with reason: host reimage [08:19:59] (03CR) 10CI reject: [V:04-1] flink-kubernetes-operator: upgrade to 1.12.1 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185850 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:20:44] (03CR) 10Brouberol: [C:03+2] flink-kubernetes-operator: upgrade to 1.12.1 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185850 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:20:50] FIRING: KubernetesCalicoDown: ml-serve1008.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1008.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:22:07] !log brouberol@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:22:49] 06SRE, 06Infrastructure-Foundations, 10netops: GRE Interfaces statistics not being returned by Juniper MX via gnmi - https://phabricator.wikimedia.org/T403936 (10cmooney) 03NEW p:05Triage→03Low [08:22:59] 06SRE, 06Infrastructure-Foundations, 10netops: GRE Interfaces statistics not being returned by Juniper MX via gnmi - https://phabricator.wikimedia.org/T403936#11156395 (10cmooney) [08:23:01] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#11156396 (10cmooney) [08:23:20] (03CR) 10Fabfur: [C:03+1] "That should've been already migrated to `profile::cache::haproxy::allowed_methods`` AFAIK" [puppet] - 10https://gerrit.wikimedia.org/r/1183274 (https://phabricator.wikimedia.org/T392073) (owner: 10Krinkle) [08:23:45] !log brouberol@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:24:22] RECOVERY - Host ml-serve1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:24:57] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:25:10] (03PS8) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:25:50] RESOLVED: KubernetesCalicoDown: ml-serve1008.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1008.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:27:03] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:27:18] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync [08:27:34] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [08:28:07] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:28:21] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:28:22] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [08:28:49] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [08:28:50] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11156411 (10klausman) When trying to run the cookbook against ml-serve1008, I got this: ` ==> Are you sure to proceed to apply BIOS/iDRAC settings for host ml-serve1008.mgmt.eqiad.wmnet with ch... [08:31:16] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11156415 (10klausman) When trying to run the cookbook against ml-serve1008, I got this: ` ==> Are you sure to proceed to apply BIOS/iDRAC settings for host ml-serve1008.mgmt.eq... [08:32:07] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [08:32:38] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [08:34:01] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:34:13] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:34:29] !log klausman@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:34:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow3004.esams.wmnet with OS bookworm [08:34:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow3004.esams.wmnet [08:35:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11156445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow3004.esams.wmnet with OS bookworm completed: - netflo... [08:35:11] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [08:35:29] !log klausman@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:35:34] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [08:38:23] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [08:38:23] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:38:40] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:38:52] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:39:03] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:39:25] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [08:39:55] (03PS1) 10Ayounsi: esams: update netflow3003 to 3004 [homer/public] - 10https://gerrit.wikimedia.org/r/1185859 (https://phabricator.wikimedia.org/T402259) [08:41:12] (03CR) 10Btullis: [C:03+1] flink-kubernetes-operator: upgrade to 1.12.1 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185849 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:41:17] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:41:21] (03CR) 10Btullis: [C:03+1] flink-kubernetes-operator: upgrade to 1.12.1 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185852 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:41:29] (03CR) 10Btullis: [C:03+1] flink-kubernetes-operator: upgrade to 1.12.1 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185851 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:41:30] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [08:41:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:41:39] (03CR) 10Btullis: [C:03+1] flink-kubernetes-operator: upgrade to 1.12.1 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185853 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:41:41] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:41:44] (03CR) 10Btullis: [C:03+1] flink-kubernetes-operator: upgrade to 1.12.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185854 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [08:42:22] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:42:33] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [08:44:48] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:44:59] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:45:10] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:45:18] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:46:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:47:46] (03PS1) 10KartikMistry: Update cxserver to 2025-09-08-084009-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185861 (https://phabricator.wikimedia.org/T403730) [08:47:55] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:48:06] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:50:21] !log klausman@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:52:36] (03CR) 10Muehlenhoff: [C:03+2] Apply netinsights role to netflow3004 [puppet] - 10https://gerrit.wikimedia.org/r/1185710 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [08:53:27] klausman@cumin1003 provision (PID 1556977) is awaiting input [08:54:26] !log klausman@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:54:35] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:54:49] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:58:47] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [08:59:04] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:03:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install3004.wikimedia.org [09:03:06] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:03:20] !log klausman@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:03:42] !log klausman@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:04:01] Minor cxserver deployment.. [09:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:05:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2004.codfw.wmnet [09:07:02] !log klausman@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ml-lab1002.eqiad.wmnet with reason: Maintenance work for T401964 [09:07:05] T401964: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964 [09:07:26] !log klausman@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:07:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1159.eqiad.wmnet with reason: Maintenance [09:07:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2004.codfw.wmnet [09:07:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T401906)', diff saved to https://phabricator.wikimedia.org/P82674 and previous config saved to /var/cache/conftool/dbconfig/20250908-090734-fceratto.json [09:07:37] !log klausman@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:07:38] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:08:04] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:08:38] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:09:12] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3004.wikimedia.org - jmm@cumin2002" [09:09:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install3004.wikimedia.org - jmm@cumin2002" [09:09:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:09:33] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install3004.wikimedia.org on all recursors [09:09:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install3004.wikimedia.org on all recursors [09:09:49] !log klausman@cumin1003 START - Cookbook sre.hosts.remove-downtime for ml-lab1002.eqiad.wmnet [09:09:49] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-lab1002.eqiad.wmnet [09:09:49] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:09:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T401906)', diff saved to https://phabricator.wikimedia.org/P82675 and previous config saved to /var/cache/conftool/dbconfig/20250908-090958-fceratto.json [09:10:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install3004.wikimedia.org - jmm@cumin2002" [09:10:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install3004.wikimedia.org - jmm@cumin2002" [09:10:23] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [09:11:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1005.eqiad.wmnet [09:11:29] !log Updated cxserver to 2025-09-08-084009-production (T403730) [09:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:33] T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730 [09:11:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:13:13] jmm@cumin2002 makevm (PID 613442) is awaiting input [09:15:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1005.eqiad.wmnet [09:16:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:17:23] FIRING: GnmiTargetDown: asw1-by27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:18:35] !log dropping all objectcache table everywhere (T397367) [09:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:39] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [09:23:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:23:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1163 (T402925)', diff saved to https://phabricator.wikimedia.org/P82676 and previous config saved to /var/cache/conftool/dbconfig/20250908-092311-ladsgroup.json [09:23:15] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:25:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P82677 and previous config saved to /var/cache/conftool/dbconfig/20250908-092506-fceratto.json [09:27:23] RESOLVED: GnmiTargetDown: asw1-by27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:33:25] (03Merged) 10jenkins-bot: flink-kubernetes-operator: upgrade to 1.12.1 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185849 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [09:33:53] (03Merged) 10jenkins-bot: flink-kubernetes-operator: upgrade to 1.12.1 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185852 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [09:33:55] (03Merged) 10jenkins-bot: flink-kubernetes-operator: upgrade to 1.12.1 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185851 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [09:33:57] (03Merged) 10jenkins-bot: flink-kubernetes-operator: upgrade to 1.12.1 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185853 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [09:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:34:00] (03Merged) 10jenkins-bot: flink-kubernetes-operator: upgrade to 1.12.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185854 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [09:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:40:12] !log brouberol@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:40:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P82678 and previous config saved to /var/cache/conftool/dbconfig/20250908-094013-fceratto.json [09:46:48] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet [09:46:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host durum3006.esams.wmnet [09:47:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1003.eqiad.wmnet [09:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:50:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum3006.esams.wmnet [09:50:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1003.eqiad.wmnet [09:51:48] (03CR) 10Muehlenhoff: [C:03+2] Make durum3006 a durum node [puppet] - 10https://gerrit.wikimedia.org/r/1185094 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [09:51:55] !log brouberol@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:52:42] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1001.eqiad.wmnet [09:52:57] !log brouberol@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:53:12] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 949887320 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:54:02] (03CR) 10David Caro: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [09:54:06] !log brouberol@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:54:12] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 51232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:54:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve1012.eqiad.wmnet [09:55:10] !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts netflow3003.esams.wmnet [09:55:21] (03PS1) 10Brouberol: flink-kubernetes-operator: upgrade to 1.12.1 (fix typo) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185872 (https://phabricator.wikimedia.org/T403838) [09:55:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T401906)', diff saved to https://phabricator.wikimedia.org/P82679 and previous config saved to /var/cache/conftool/dbconfig/20250908-095521-fceratto.json [09:55:27] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:55:28] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts netflow3003.esams.wmnet [09:55:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:55:45] (03CR) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [09:55:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:56:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T401906)', diff saved to https://phabricator.wikimedia.org/P82680 and previous config saved to /var/cache/conftool/dbconfig/20250908-095602-fceratto.json [09:56:27] (03PS1) 10Ayounsi: Kafka: remove netflow3003 ACL before decom [puppet] - 10https://gerrit.wikimedia.org/r/1185873 (https://phabricator.wikimedia.org/T402259) [09:57:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1185873 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [09:57:55] (03CR) 10Ayounsi: [C:03+2] Kafka: remove netflow3003 ACL before decom [puppet] - 10https://gerrit.wikimedia.org/r/1185873 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [09:58:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T401906)', diff saved to https://phabricator.wikimedia.org/P82681 and previous config saved to /var/cache/conftool/dbconfig/20250908-095826-fceratto.json [09:59:24] !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts netflow3003.esams.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1000) [10:01:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T402925)', diff saved to https://phabricator.wikimedia.org/P82682 and previous config saved to /var/cache/conftool/dbconfig/20250908-100107-ladsgroup.json [10:01:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1012.eqiad.wmnet [10:01:18] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:01:44] (03PS1) 10Btullis: Fix the partman recipe for dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1185874 (https://phabricator.wikimedia.org/T399779) [10:03:10] (03CR) 10Muehlenhoff: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [10:04:02] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [10:04:02] (03CR) 10Vgutierrez: [C:03+1] team-traffic: removed haproxykafka critical alert [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [10:04:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11156849 (10ayounsi) [10:05:19] (03CR) 10Vgutierrez: [C:03+2] haproxy: Add an Allow header on 405 responses [puppet] - 10https://gerrit.wikimedia.org/r/1184887 (https://phabricator.wikimedia.org/T403767) (owner: 10Vgutierrez) [10:06:26] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11156853 (10BTullis) >>! In T399779#11149794, @Jclark-ctr wrote: > @bking @BTullis Can you assist with preseed.yaml? It doesn’t appear to be configured for EFI booting on... [10:06:52] (03PS1) 10Muehlenhoff: Also enable new Bird for durum3006 [puppet] - 10https://gerrit.wikimedia.org/r/1185876 (https://phabricator.wikimedia.org/T402259) [10:07:47] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [10:08:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [10:08:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:08:12] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow3003.esams.wmnet [10:08:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11156860 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1003 for hosts: `netflow3003.esams.wmnet` - netflow3003.esams.wmnet (**PA... [10:08:52] (03CR) 10Muehlenhoff: [C:03+2] Also enable new Bird for durum3006 [puppet] - 10https://gerrit.wikimedia.org/r/1185876 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [10:08:57] FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:09:36] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11156879 (10Jclark-ctr) I was going off the server GUI for Bmc it list the drives as nvme but will double check physically when I get in today [10:10:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [10:10:57] (03CR) 10Cathal Mooney: [C:03+2] WMF-Plugin: Include the BGP role when exposing the IGBP data (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [10:11:09] (03PS2) 10Muehlenhoff: Make doh3006 a wikidough node [puppet] - 10https://gerrit.wikimedia.org/r/1185047 (https://phabricator.wikimedia.org/T402259) [10:11:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install3004.wikimedia.org with OS bookworm [10:11:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11156882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install3004.wikimedia.org with OS bookworm [10:13:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P82683 and previous config saved to /var/cache/conftool/dbconfig/20250908-101334-fceratto.json [10:13:57] RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:14:05] (03PS3) 10Muehlenhoff: Make doh3006 a wikidough node [puppet] - 10https://gerrit.wikimedia.org/r/1185047 (https://phabricator.wikimedia.org/T402259) [10:16:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P82684 and previous config saved to /var/cache/conftool/dbconfig/20250908-101614-ladsgroup.json [10:16:49] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Update wmf-plugin IBGP output - cmooney@cumin1003 [10:17:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [10:18:41] (03Merged) 10jenkins-bot: JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [10:19:15] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Update wmf-plugin IBGP output - cmooney@cumin1003 [10:20:31] (03Abandoned) 10Ladsgroup: Add x1 to DBRecordCache for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145243 (owner: 10Ladsgroup) [10:22:30] (03CR) 10Btullis: [C:03+2] Remove references to dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1184070 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [10:23:29] (03CR) 10Btullis: [V:03+1 C:03+2] Bump the size of the java heap for the HDFS namenodes [puppet] - 10https://gerrit.wikimedia.org/r/1185082 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [10:25:39] (03CR) 10Ayounsi: [C:03+1] "fyi, the BGP flag in Netbox needs to be set (and homer run) for the host to start receiving traffic." [puppet] - 10https://gerrit.wikimedia.org/r/1185047 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [10:27:04] (03PS1) 10Elukey: redfish: support weak Etag values in change_user_password [software/spicerack] - 10https://gerrit.wikimedia.org/r/1185877 (https://phabricator.wikimedia.org/T392851) [10:28:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P82685 and previous config saved to /var/cache/conftool/dbconfig/20250908-102842-fceratto.json [10:29:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:30:54] (03CR) 10Volans: [C:03+1] "LGTM, minor nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1185877 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [10:31:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P82686 and previous config saved to /var/cache/conftool/dbconfig/20250908-103122-ladsgroup.json [10:31:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:33:18] (03PS2) 10Elukey: redfish: support weak Etag values in change_user_password [software/spicerack] - 10https://gerrit.wikimedia.org/r/1185877 (https://phabricator.wikimedia.org/T392851) [10:33:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2009.codfw.wmnet [10:33:26] (03CR) 10Elukey: redfish: support weak Etag values in change_user_password (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1185877 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [10:35:26] (03PS1) 10Federico Ceratto: es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) [10:35:26] (03CR) 10Federico Ceratto: "Enable notifications before putting the host in production" [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:36:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11156961 (10elukey) This patch https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1185877 should solve the last issue with Redfish, but it require... [10:37:13] RESOLVED: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [10:38:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2009.codfw.wmnet [10:43:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T401906)', diff saved to https://phabricator.wikimedia.org/P82687 and previous config saved to /var/cache/conftool/dbconfig/20250908-104350-fceratto.json [10:43:54] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:44:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [10:44:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T401906)', diff saved to https://phabricator.wikimedia.org/P82688 and previous config saved to /var/cache/conftool/dbconfig/20250908-104413-fceratto.json [10:45:45] (03PS1) 10Aklapper: phabricator: remove defunct ElasticSearch backend settings [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) [10:46:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T402925)', diff saved to https://phabricator.wikimedia.org/P82689 and previous config saved to /var/cache/conftool/dbconfig/20250908-104629-ladsgroup.json [10:46:34] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:46:45] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:46:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T402925)', diff saved to https://phabricator.wikimedia.org/P82690 and previous config saved to /var/cache/conftool/dbconfig/20250908-104652-ladsgroup.json [10:48:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install3004.wikimedia.org with reason: host reimage [10:48:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host doh3006.wikimedia.org [10:49:32] (03CR) 10Brouberol: [C:03+1] Fix the partman recipe for dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1185874 (https://phabricator.wikimedia.org/T399779) (owner: 10Btullis) [10:49:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:50] (03CR) 10Brouberol: [C:03+2] flink-kubernetes-operator: upgrade to 1.12.1 (fix typo) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185872 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [10:52:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doh3006.wikimedia.org [10:53:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install3004.wikimedia.org with reason: host reimage [10:54:41] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.362 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:57:39] !log brouberol@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:58:44] !log brouberol@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:01:24] !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:01:43] (03PS1) 10Brouberol: dse-k8s-eqiad/mediawiki-dumps-legacy: set the max memory to ~64GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185886 (https://phabricator.wikimedia.org/T403947) [11:02:20] (03CR) 10Btullis: [C:03+2] Fix the partman recipe for dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1185874 (https://phabricator.wikimedia.org/T399779) (owner: 10Btullis) [11:02:50] (03CR) 10Muehlenhoff: [C:03+2] Make doh3006 a wikidough node [puppet] - 10https://gerrit.wikimedia.org/r/1185047 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [11:02:56] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad/mediawiki-dumps-legacy: set the max memory to ~64GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185886 (https://phabricator.wikimedia.org/T403947) (owner: 10Brouberol) [11:03:22] !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:04:00] !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:05:45] !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:06:58] (03CR) 10Clément Goubert: [C:03+2] "Done (cleanup)" [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T394423) (owner: 10Hnowlan) [11:07:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:07:24] (03CR) 10Clément Goubert: [C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [11:07:39] (03CR) 10Clément Goubert: [C:03+2] "cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [11:07:58] (03CR) 10Clément Goubert: [C:03+2] "cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:08:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:08:09] (03CR) 10Clément Goubert: [C:03+2] "cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [11:08:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install3004.wikimedia.org with OS bookworm [11:08:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install3004.wikimedia.org [11:08:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11157040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install3004.wikimedia.org with OS bookworm completed: - inst... [11:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:10:08] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad/mediawiki-dumps-legacy: set the max memory to ~64GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185886 (https://phabricator.wikimedia.org/T403947) (owner: 10Brouberol) [11:10:51] (03CR) 10Majavah: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [11:11:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:11:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:13:19] (03CR) 10Btullis: [C:03+1] clouddumps: allow DSE pods to acess the https port [puppet] - 10https://gerrit.wikimedia.org/r/1185843 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [11:14:08] (03CR) 10Stevemunene: [C:03+1] clouddumps: allow DSE pods to acess the https port [puppet] - 10https://gerrit.wikimedia.org/r/1185843 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [11:14:24] (03CR) 10Brouberol: [V:03+1 C:03+2] clouddumps: allow DSE pods to acess the https port [puppet] - 10https://gerrit.wikimedia.org/r/1185843 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [11:14:40] (03CR) 10Muehlenhoff: [C:03+2] Remove ncredir3004 [puppet] - 10https://gerrit.wikimedia.org/r/1185709 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [11:19:56] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir3004.esams.wmnet [11:20:12] (03CR) 10Arnaudb: [C:03+2] "typo in class mtail definition triggering a bug" [puppet] - 10https://gerrit.wikimedia.org/r/1185848 (owner: 10Arnaudb) [11:21:25] (03PS1) 10Arnaudb: Revert^4 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185888 [11:24:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T402925)', diff saved to https://phabricator.wikimedia.org/P82691 and previous config saved to /var/cache/conftool/dbconfig/20250908-112414-ladsgroup.json [11:24:18] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:24:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:30:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3004.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:31:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3004.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:31:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:31:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir3004.esams.wmnet [11:31:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11157083 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir3004.esams.wmnet` - ncredir3004.esams.wmnet (**PASS**... [11:35:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum3005.esams.wmnet to drbd [11:36:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11157092 (10ops-monitoring-bot) VM durum3005.esams.wmnet switching disk type to drbd [11:39:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P82692 and previous config saved to /var/cache/conftool/dbconfig/20250908-113922-ladsgroup.json [11:40:06] (03CR) 10Btullis: [C:03+2] Enable prometheus support for dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1185074 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [11:42:23] (03PS1) 10Brouberol: global_config: define an external service for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1185894 (https://phabricator.wikimedia.org/T402784) [11:42:59] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1233.eqiad.wmnet with OS bullseye [11:43:22] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1233.eqiad.wmnet with OS bullseye [11:44:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T401906)', diff saved to https://phabricator.wikimedia.org/P82693 and previous config saved to /var/cache/conftool/dbconfig/20250908-114429-fceratto.json [11:44:33] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:44:36] !log restart netbox service on netbox-dev2003 (netbox-next) to update db from live server dump [11:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:00] (03CR) 10Federico Ceratto: "There are no active alarms, and `sudo journalctl -p0..2` shows only:" [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:45:02] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6875/co" [puppet] - 10https://gerrit.wikimedia.org/r/1185894 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [11:45:04] !log btullis@cumin1003 START - Cookbook sre.presto.roll-restart-workers for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [11:45:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum3005.esams.wmnet to drbd [11:45:57] PROBLEM - Host durum3005 is DOWN: PING CRITICAL - Packet loss = 100% [11:46:55] RECOVERY - Host durum3005 is UP: PING OK - Packet loss = 0%, RTA = 80.55 ms [11:47:25] PROBLEM - Bird Internet Routing Daemon on durum3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:48:25] RECOVERY - Bird Internet Routing Daemon on durum3005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:49:33] (03PS1) 10Brouberol: airflow-test-k8s: authorize task pods to reach out to the dumps public site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185897 (https://phabricator.wikimedia.org/T402784) [11:49:41] !log btullis@cumin1003 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [11:50:31] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (035 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [11:50:57] (03PS9) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [11:54:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P82694 and previous config saved to /var/cache/conftool/dbconfig/20250908-115429-ladsgroup.json [11:54:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:55:25] !log Upgrading trixie installer image to 13.1 T403815 [11:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:28] T403815: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815 [11:56:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:56:59] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11157121 (10MoritzMuehlenhoff) [11:57:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [11:58:11] (03PS10) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [11:59:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P82695 and previous config saved to /var/cache/conftool/dbconfig/20250908-115937-fceratto.json [12:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.064s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:02:48] (03PS1) 10Muehlenhoff: Apply installserver role to install3004 [puppet] - 10https://gerrit.wikimedia.org/r/1185898 (https://phabricator.wikimedia.org/T402259) [12:04:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182541 (https://phabricator.wikimedia.org/T376049) (owner: 10Anzx) [12:09:01] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11157146 (10MoritzMuehlenhoff) [12:09:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T402925)', diff saved to https://phabricator.wikimedia.org/P82696 and previous config saved to /var/cache/conftool/dbconfig/20250908-120937-ladsgroup.json [12:09:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.954 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:09:42] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:09:53] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [12:10:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T402925)', diff saved to https://phabricator.wikimedia.org/P82697 and previous config saved to /var/cache/conftool/dbconfig/20250908-121000-ladsgroup.json [12:11:29] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.388 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:44] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:12:55] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:14:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P82698 and previous config saved to /var/cache/conftool/dbconfig/20250908-121444-fceratto.json [12:15:06] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:16:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [12:27:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.01s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:59] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11157187 (10mahmoud.abdelsattar.wmde) Hello @KFrancis :) Could you please confirm the signing of the NDA agreemen... [12:29:04] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868#11157188 (10KSiebert) I am approving. [12:29:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T401906)', diff saved to https://phabricator.wikimedia.org/P82699 and previous config saved to /var/cache/conftool/dbconfig/20250908-122952-fceratto.json [12:30:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:30:04] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:30:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T401906)', diff saved to https://phabricator.wikimedia.org/P82700 and previous config saved to /var/cache/conftool/dbconfig/20250908-123007-fceratto.json [12:30:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh3005.wikimedia.org to drbd [12:30:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11157193 (10ops-monitoring-bot) VM doh3005.wikimedia.org switching disk type to drbd [12:32:02] (03CR) 10Btullis: [C:03+1] global_config: define an external service for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1185894 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [12:32:04] (03PS11) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:32:16] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: authorize task pods to reach out to the dumps public site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185897 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [12:32:25] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: define an external service for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1185894 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [12:32:26] (03PS3) 10Stevemunene: druid: Bring druid1012.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182698 (https://phabricator.wikimedia.org/T397441) [12:32:26] (03PS3) 10Stevemunene: druid: Bring druid1013.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182699 (https://phabricator.wikimedia.org/T397441) [12:32:26] (03PS3) 10Stevemunene: druid: Add druid druid101[2-3] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1182700 (https://phabricator.wikimedia.org/T397441) [12:32:27] (03PS2) 10Stevemunene: druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) [12:32:30] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: authorize task pods to reach out to the dumps public site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185897 (https://phabricator.wikimedia.org/T402784) (owner: 10Brouberol) [12:32:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T401906)', diff saved to https://phabricator.wikimedia.org/P82701 and previous config saved to /var/cache/conftool/dbconfig/20250908-123232-fceratto.json [12:35:58] btullis@cumin1003 reimage (PID 1586679) is awaiting input [12:37:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.264s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:40:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh3005.wikimedia.org to drbd [12:40:15] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:55] RECOVERY - Host doh3005 is UP: PING OK - Packet loss = 0%, RTA = 80.72 ms [12:41:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:41:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:41:47] 06SRE, 10LDAP-Access-Requests: Grant Access to Wmf LDAP group for FRomeo (WMF) - https://phabricator.wikimedia.org/T403960 (10FRomeo_WMF) 03NEW [12:42:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.264s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:42:25] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:42:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:43:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:43:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11157286 (10MoritzMuehlenhoff) [12:44:25] RECOVERY - Bird Internet Routing Daemon on doh3005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:44:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T402925)', diff saved to https://phabricator.wikimedia.org/P82702 and previous config saved to /var/cache/conftool/dbconfig/20250908-124436-ladsgroup.json [12:44:40] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:46:23] (03PS1) 10Btullis: Add cumin aliases to differentiate between the two cephosd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1185916 (https://phabricator.wikimedia.org/T395240) [12:46:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:27] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:47:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P82703 and previous config saved to /var/cache/conftool/dbconfig/20250908-124739-fceratto.json [12:47:50] (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role to install3004 [puppet] - 10https://gerrit.wikimedia.org/r/1185898 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [12:50:56] (03PS1) 10Btullis: Update ceph cookbook to refer to the two specific cephosd clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/1185917 (https://phabricator.wikimedia.org/T395240) [12:51:35] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd100*.eqiad.wmnet} and (A:cephosd) [12:52:04] (03PS1) 10Muehlenhoff: Point webproxy in esams to install3004 [dns] - 10https://gerrit.wikimedia.org/r/1185918 (https://phabricator.wikimedia.org/T402259) [12:54:09] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:54:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.696s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:55:38] (03CR) 10Brouberol: [C:03+1] Add cumin aliases to differentiate between the two cephosd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1185916 (https://phabricator.wikimedia.org/T395240) (owner: 10Btullis) [12:55:48] (03CR) 10Brouberol: [C:03+1] Update ceph cookbook to refer to the two specific cephosd clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/1185917 (https://phabricator.wikimedia.org/T395240) (owner: 10Btullis) [12:59:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.162s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:59:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P82705 and previous config saved to /var/cache/conftool/dbconfig/20250908-125943-ladsgroup.json [13:00:04] Urbanecm and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1300). [13:00:04] KCVelaga and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] (03CR) 10CI reject: [V:04-1] Update ceph cookbook to refer to the two specific cephosd clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/1185917 (https://phabricator.wikimedia.org/T395240) (owner: 10Btullis) [13:01:09] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:01:15] (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [13:01:44] (03PS1) 10Stevemunene: Change all druid_public hosts references to use svc url [puppet] - 10https://gerrit.wikimedia.org/r/1185922 (https://phabricator.wikimedia.org/T397441) [13:02:07] (03PS13) 10Arnaudb: Revert^4 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185888 [13:02:07] (03CR) 10Arnaudb: [C:03+2] "test new mtail config, I will preshot the sanity revert" [puppet] - 10https://gerrit.wikimedia.org/r/1185888 (owner: 10Arnaudb) [13:02:16] o/ [13:02:33] (03PS1) 10Arnaudb: Revert^5 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185923 [13:02:35] (03Abandoned) 10Stevemunene: druid: Change druid host used to run refinery data puge job [puppet] - 10https://gerrit.wikimedia.org/r/1185839 (owner: 10Stevemunene) [13:02:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P82706 and previous config saved to /var/cache/conftool/dbconfig/20250908-130247-fceratto.json [13:03:30] o/ [13:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:05:58] (03PS1) 10Muehlenhoff: Add dummy keytab for install3004 [labs/private] - 10https://gerrit.wikimedia.org/r/1185924 (https://phabricator.wikimedia.org/T402259) [13:09:24] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add dummy keytab for install3004 [labs/private] - 10https://gerrit.wikimedia.org/r/1185924 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:11:09] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:14:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P82707 and previous config saved to /var/cache/conftool/dbconfig/20250908-131451-ladsgroup.json [13:15:12] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host durum1001.eqiad.wmnet [13:15:48] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host durum2001.codfw.wmnet [13:15:57] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host durum4001.ulsfo.wmnet [13:16:11] (03PS12) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:16:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:17:09] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T401906)', diff saved to https://phabricator.wikimedia.org/P82708 and previous config saved to /var/cache/conftool/dbconfig/20250908-131755-fceratto.json [13:17:59] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:18:11] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [13:18:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T401906)', diff saved to https://phabricator.wikimedia.org/P82709 and previous config saved to /var/cache/conftool/dbconfig/20250908-131818-fceratto.json [13:18:36] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11157462 (10Jclark-ctr) @btullis that is my mistake they are not NVME So legacy could work for this. [13:19:07] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum1001.eqiad.wmnet [13:19:10] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:19:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:32] ^ expected [13:19:48] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum2001.codfw.wmnet [13:20:01] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1185922 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [13:20:11] gerrit2003 is me (expected) [13:20:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum4001.ulsfo.wmnet [13:20:33] (03CR) 10Ayounsi: [C:03+1] Point webproxy in esams to install3004 [dns] - 10https://gerrit.wikimedia.org/r/1185918 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:20:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T401906)', diff saved to https://phabricator.wikimedia.org/P82710 and previous config saved to /var/cache/conftool/dbconfig/20250908-132044-fceratto.json [13:21:52] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum3004.esams.wmnet [13:22:17] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11157498 (10elukey) @klausman I am going to check the provision cookbook, it may be related to new code paths that raise these problems for ml hosts. Thanks for the report! [13:24:10] RESOLVED: [10x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:25:01] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:25:21] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:26:10] (03CR) 10Papaul: [C:03+2] Add eqsin private IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (owner: 10Papaul) [13:26:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.346 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:26:37] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:26:47] (03CR) 10Papaul: [C:03+2] Adding BGP to mr1-eqsin, cr2/3-eqsin to replace OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/1185112 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [13:28:46] (03Merged) 10jenkins-bot: Adding BGP to mr1-eqsin, cr2/3-eqsin to replace OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/1185112 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [13:28:47] (03Merged) 10jenkins-bot: Add eqsin private IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (owner: 10Papaul) [13:29:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T402925)', diff saved to https://phabricator.wikimedia.org/P82711 and previous config saved to /var/cache/conftool/dbconfig/20250908-132958-ladsgroup.json [13:30:03] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:30:15] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1195.eqiad.wmnet with reason: Maintenance [13:30:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T402925)', diff saved to https://phabricator.wikimedia.org/P82712 and previous config saved to /var/cache/conftool/dbconfig/20250908-133021-ladsgroup.json [13:30:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum3004.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:31:59] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1233.eqiad.wmnet with OS bullseye [13:32:01] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:33:28] jmm@cumin2002 decommission (PID 741808) is awaiting input [13:33:35] (03PS2) 10Stevemunene: Change all druid_public hosts references to use svc url [puppet] - 10https://gerrit.wikimedia.org/r/1185922 (https://phabricator.wikimedia.org/T397441) [13:33:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum3004.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:33:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:33:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum3004.esams.wmnet [13:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11157541 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum3004.esams.wmnet` - durum3004.esams.wmnet (**PASS**)... [13:34:43] (03PS1) 10Andrew Bogott: eqiad1 cloudceph version pacific -> quincy [puppet] - 10https://gerrit.wikimedia.org/r/1185937 (https://phabricator.wikimedia.org/T402190) [13:35:23] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:35:28] (03CR) 10Btullis: [C:03+2] Add cumin aliases to differentiate between the two cephosd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1185916 (https://phabricator.wikimedia.org/T395240) (owner: 10Btullis) [13:35:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P82713 and previous config saved to /var/cache/conftool/dbconfig/20250908-133552-fceratto.json [13:36:13] (03CR) 10Arnaudb: [C:03+2] Revert^5 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1185923 (owner: 10Arnaudb) [13:37:04] !incidents [13:37:05] No incidents occurred in the past 24 hours for team SRE [13:38:04] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11157551 (10elukey) I briefly checked in the ml-serve1009's BIOS settings, and there is no key with "LAN" inside, hence the cookbook fails because the nic list to set is empty.... [13:38:04] (03CR) 10Andrew Bogott: [C:03+2] eqiad1 cloudceph version pacific -> quincy [puppet] - 10https://gerrit.wikimedia.org/r/1185937 (https://phabricator.wikimedia.org/T402190) (owner: 10Andrew Bogott) [13:38:16] (03PS1) 10Muehlenhoff: Apply config to enable new Bird release on the role/esams level [puppet] - 10https://gerrit.wikimedia.org/r/1185938 (https://phabricator.wikimedia.org/T402259) [13:39:05] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh3004.wikimedia.org [13:39:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [13:39:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11157560 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [13:40:09] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:51] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:43:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:12] (03PS2) 10Btullis: Update ceph cookbook to refer to the two specific cephosd clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/1185917 (https://phabricator.wikimedia.org/T395240) [13:44:50] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1233.eqiad.wmnet with OS bullseye [13:46:08] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:22] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh3004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:49:53] (03PS1) 10Muehlenhoff: Update DHCP server in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1185940 (https://phabricator.wikimedia.org/T402259) [13:50:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh3004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:50:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh3004.wikimedia.org [13:50:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11157614 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `doh3004.wikimedia.org` - doh3004.wikimedia.org (**PASS**)... [13:51:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P82714 and previous config saved to /var/cache/conftool/dbconfig/20250908-135100-fceratto.json [13:51:08] (03PS8) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [13:51:10] (03PS1) 10Muehlenhoff: Update DHCP server in esams [puppet] - 10https://gerrit.wikimedia.org/r/1185941 (https://phabricator.wikimedia.org/T402259) [13:51:32] (03PS2) 10Muehlenhoff: Update DHCP server in esams [puppet] - 10https://gerrit.wikimedia.org/r/1185941 (https://phabricator.wikimedia.org/T402259) [13:51:52] (03CR) 10Btullis: [C:03+2] Update ceph cookbook to refer to the two specific cephosd clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/1185917 (https://phabricator.wikimedia.org/T395240) (owner: 10Btullis) [13:52:30] (03PS2) 10Muehlenhoff: Apply config to enable new Bird release on the role/esams level [puppet] - 10https://gerrit.wikimedia.org/r/1185938 (https://phabricator.wikimedia.org/T402259) [13:52:48] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:54:02] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:54:12] (03PS2) 10Daimona Eaytoy: tables-catalog: Document ce_event_contributions (CampaignEvents) [puppet] - 10https://gerrit.wikimedia.org/r/1184501 (https://phabricator.wikimedia.org/T400719) [13:54:16] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Document ce_event_contributions (CampaignEvents) [puppet] - 10https://gerrit.wikimedia.org/r/1184501 (https://phabricator.wikimedia.org/T400719) (owner: 10Daimona Eaytoy) [13:55:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1185938 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:56:57] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11157630 (10MoritzMuehlenhoff) [13:57:13] KCVelaga: so sorry no one handled the window yet! [13:57:19] let's do the deployment, we should have time. [13:57:35] also anzx if still around [13:57:46] (03PS3) 10KCVelaga: Disable User Agent collection for MinT for Readers streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) [13:58:15] actually, anzx's just a cleanup, i'll do that anyway [13:58:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) (owner: 10KCVelaga) [13:58:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182541 (https://phabricator.wikimedia.org/T376049) (owner: 10Anzx) [13:58:40] urbanecm: ty [13:58:48] urbanecm thanks! [13:59:33] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11157661 (10Jgreen) [13:59:37] (03Merged) 10jenkins-bot: Disable User Agent collection for MinT for Readers streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) (owner: 10KCVelaga) [13:59:39] (03Merged) 10jenkins-bot: hawiki: remove temporary logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182541 (https://phabricator.wikimedia.org/T376049) (owner: 10Anzx) [13:59:46] (03CR) 10Scott French: [C:03+1] envoy-future: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185232 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [13:59:47] (03CR) 10Fabfur: [C:03+2] team-traffic: removed haproxykafka critical alert [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [14:00:23] (03PS1) 10Andrew Bogott: Update spec tests to expect ceph version 'quincy' [puppet] - 10https://gerrit.wikimedia.org/r/1185943 (https://phabricator.wikimedia.org/T402190) [14:00:35] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1182621|Disable User Agent collection for MinT for Readers streams (T398057)]], [[gerrit:1182541|hawiki: remove temporary logo files (T376049)]] [14:00:40] T398057: Opt-out of User Agent collection for MinT for Readers stream - https://phabricator.wikimedia.org/T398057 [14:00:40] T376049: Requesting temporary logo change for ha.wiki - https://phabricator.wikimedia.org/T376049 [14:00:50] (03Merged) 10jenkins-bot: Update ceph cookbook to refer to the two specific cephosd clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/1185917 (https://phabricator.wikimedia.org/T395240) (owner: 10Btullis) [14:01:01] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:02] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:02:41] (03CR) 10Btullis: [C:03+1] Change all druid_public hosts references to use svc url [puppet] - 10https://gerrit.wikimedia.org/r/1185922 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [14:04:14] (03PS1) 10Scott French: mediawiki: allow kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1185944 (https://phabricator.wikimedia.org/T401425) [14:04:42] (03CR) 10Andrew Bogott: [C:03+2] Update spec tests to expect ceph version 'quincy' [puppet] - 10https://gerrit.wikimedia.org/r/1185943 (https://phabricator.wikimedia.org/T402190) (owner: 10Andrew Bogott) [14:04:56] (03CR) 10Clément Goubert: [C:03+1] mediawiki: allow kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1185944 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French) [14:05:51] jclark@cumin1002 reimage (PID 3906730) is awaiting input [14:06:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T401906)', diff saved to https://phabricator.wikimedia.org/P82715 and previous config saved to /var/cache/conftool/dbconfig/20250908-140607-fceratto.json [14:06:12] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:06:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T402925)', diff saved to https://phabricator.wikimedia.org/P82716 and previous config saved to /var/cache/conftool/dbconfig/20250908-140622-ladsgroup.json [14:06:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [14:06:26] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:06:42] (03CR) 10Scott French: [C:03+2] mediawiki: allow kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1185944 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French) [14:06:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [14:07:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T401906)', diff saved to https://phabricator.wikimedia.org/P82717 and previous config saved to /var/cache/conftool/dbconfig/20250908-140705-fceratto.json [14:08:11] (03Merged) 10jenkins-bot: mediawiki: allow kube-dumps in MediaWikiMemcachedHighErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1185944 (https://phabricator.wikimedia.org/T401425) (owner: 10Scott French) [14:08:45] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1233.eqiad.wmnet with reason: host reimage [14:09:20] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:09:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T401906)', diff saved to https://phabricator.wikimedia.org/P82718 and previous config saved to /var/cache/conftool/dbconfig/20250908-140932-fceratto.json [14:10:39] (03PS2) 10Ebernhardson: cirrus: Reduce galleries weight in search on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) [14:10:39] (03CR) 10Ebernhardson: cirrus: Reduce galleries weight in search on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [14:13:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1233.eqiad.wmnet with reason: host reimage [14:15:14] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host durum1002.eqiad.wmnet [14:15:32] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [14:15:54] (03CR) 10Arnaudb: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1185943 (https://phabricator.wikimedia.org/T402190) (owner: 10Andrew Bogott) [14:17:42] (03CR) 10DCausse: cirrus: Reduce galleries weight in search on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [14:18:28] the build is very quick... [14:19:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:19:11] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum1002.eqiad.wmnet [14:19:44] urbanecm: that can be "fixed" easily [14:20:37] taavi: by making the deployment a no-op? or something more clever? [14:21:13] if it being very quick is a problem, it's easy to make it much slower :D [14:21:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P82719 and previous config saved to /var/cache/conftool/dbconfig/20250908-142129-ladsgroup.json [14:22:03] (03PS1) 10Jforrester: abstractwiki-rust-web: Bump version to 1.85, rustc-web upgraded over the weekend [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185948 [14:22:26] taavi: ah, i was being sarcastic :D i started scap 22 minutes ago... [14:22:37] and it's still not on mwdebug [14:23:05] First deployment of the week after the sunday rebuild if I had to guess [14:24:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:24:14] likely, but hard to predict when that happens [14:24:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P82720 and previous config saved to /var/cache/conftool/dbconfig/20250908-142439-fceratto.json [14:24:41] 06SRE, 06Infrastructure-Foundations: offboard-user: Check for use of email address of user to be offboarded across Puppet repo - https://phabricator.wikimedia.org/T403452#11157814 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:24:49] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] InitialiseSettings: Update comment about wgPopupsConflictingRefTooltipsGadgetName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170736 (https://phabricator.wikimedia.org/T362771) (owner: 10Reedy) [14:25:20] _finally_ [14:25:45] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1184942 (owner: 10Scott French) [14:25:53] (03CR) 10Scott French: [C:03+2] P:rsyslog::kafka_shipper: configure output lookup for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184942 (owner: 10Scott French) [14:26:59] (03CR) 10Ayounsi: [C:03+1] Update DHCP server in esams [puppet] - 10https://gerrit.wikimedia.org/r/1185941 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [14:27:32] (03CR) 10Ayounsi: [C:03+1] Update DHCP server in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1185940 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [14:29:03] (03PS9) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1430) [14:30:35] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:30:39] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#11157827 (10cmooney) I have set the bandwidth to '6000000000' either side manually in the UI so let's see how it goes. [14:31:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1233.eqiad.wmnet with OS bullseye [14:32:19] !log urbanecm@deploy1003 kcvelaga, urbanecm, anzx: Backport for [[gerrit:1182621|Disable User Agent collection for MinT for Readers streams (T398057)]], [[gerrit:1182541|hawiki: remove temporary logo files (T376049)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:32:24] T398057: Opt-out of User Agent collection for MinT for Readers stream - https://phabricator.wikimedia.org/T398057 [14:32:24] T376049: Requesting temporary logo change for ha.wiki - https://phabricator.wikimedia.org/T376049 [14:33:10] !log urbanecm@deploy1003 kcvelaga, urbanecm, anzx: Continuing with sync [14:33:33] KCVelaga and i tested, seems like API produces expected results https://usercontent.irccloud-cdn.com/file/CNuhcL80/image.png [14:33:34] proceeding [14:36:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P82721 and previous config saved to /var/cache/conftool/dbconfig/20250908-143637-ladsgroup.json [14:37:31] (03PS48) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [14:39:06] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [14:39:27] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [14:39:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P82722 and previous config saved to /var/cache/conftool/dbconfig/20250908-143947-fceratto.json [14:43:02] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [14:43:05] (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [14:43:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T402925)', diff saved to https://phabricator.wikimedia.org/P82723 and previous config saved to /var/cache/conftool/dbconfig/20250908-144309-ladsgroup.json [14:43:13] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:43:16] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Puppet module hiera_lookup not working - https://phabricator.wikimedia.org/T378331#11157913 (10jhathaway) a:03jhathaway [14:44:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:54] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182621|Disable User Agent collection for MinT for Readers streams (T398057)]], [[gerrit:1182541|hawiki: remove temporary logo files (T376049)]] (duration: 45m 20s) [14:46:00] finally [14:46:03] T398057: Opt-out of User Agent collection for MinT for Readers stream - https://phabricator.wikimedia.org/T398057 [14:46:03] T376049: Requesting temporary logo change for ha.wiki - https://phabricator.wikimedia.org/T376049 [14:46:32] (03PS1) 10Scott French: shellbox: update to 2025-08-29-172844 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185961 (https://phabricator.wikimedia.org/T403284) [14:46:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [14:47:41] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11157948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm executed with errors: - dse-k8s-worker10... [14:48:26] (03CR) 10Elukey: [C:03+2] redfish: support weak Etag values in change_user_password [software/spicerack] - 10https://gerrit.wikimedia.org/r/1185877 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [14:49:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.309 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:27] (03CR) 10Clément Goubert: [C:03+1] "Images exist, LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185961 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [14:51:03] (03CR) 10CDobbins: [C:03+2] dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [14:51:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.583 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:51:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T402925)', diff saved to https://phabricator.wikimedia.org/P82724 and previous config saved to /var/cache/conftool/dbconfig/20250908-145144-ladsgroup.json [14:51:49] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:51:50] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [14:52:08] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:52:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T402925)', diff saved to https://phabricator.wikimedia.org/P82725 and previous config saved to /var/cache/conftool/dbconfig/20250908-145215-ladsgroup.json [14:54:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T401906)', diff saved to https://phabricator.wikimedia.org/P82726 and previous config saved to /var/cache/conftool/dbconfig/20250908-145454-fceratto.json [14:54:59] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:55:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [14:55:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [14:56:27] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11158010 (10Jgreen) [14:57:19] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Puppet module hiera_lookup not working - https://phabricator.wikimedia.org/T378331#11158026 (10Volans) It works fine for me: ` >>> p.hiera_lookup('cumin1003.eqiad.wmnet', 'profile::puppet::agent::force_puppet7') DRY-RUN: Executing commands ['puppet look... [14:57:55] (03Merged) 10jenkins-bot: redfish: support weak Etag values in change_user_password [software/spicerack] - 10https://gerrit.wikimedia.org/r/1185877 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [14:58:02] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11158035 (10MoritzMuehlenhoff) [14:58:26] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint2002.codfw.wmnet - https://phabricator.wikimedia.org/T403855#11158041 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:58:56] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Puppet module hiera_lookup not working - https://phabricator.wikimedia.org/T378331#11158044 (10jhathaway) >>! In T378331#11158026, @Volans wrote: > It works fine for me: ok to close then? [15:04:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:24] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Puppet module hiera_lookup not working - https://phabricator.wikimedia.org/T378331#11158104 (10Volans) 05Open→03Resolved This might have been related to the migration to puppet7 and the new puppetdb hosts probably. I can't recall. Resolving as i... [15:08:25] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11158130 (10MoritzMuehlenhoff) [15:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:28] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11158149 (10Jgreen) [15:09:29] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11158150 (10Jgreen) [15:09:51] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission frmx2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403970#11158151 (10Jgreen) [15:10:16] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11158154 (10Jgreen) [15:14:02] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: support the new AMD GPU k8s plugin [puppet] - 10https://gerrit.wikimedia.org/r/1185865 (https://phabricator.wikimedia.org/T398600) (owner: 10Elukey) [15:15:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd100*.eqiad.wmnet} and (A:cephosd) [15:15:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11158190 (10MoritzMuehlenhoff) [15:18:51] (03PS1) 10DLynch: Move ve.track.js into a separate module [VisualEditor/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185970 (https://phabricator.wikimedia.org/T403745) [15:19:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T402925)', diff saved to https://phabricator.wikimedia.org/P82727 and previous config saved to /var/cache/conftool/dbconfig/20250908-151926-ladsgroup.json [15:19:32] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:19:59] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11158235 (10MoritzMuehlenhoff) [15:20:16] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [15:20:35] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [15:21:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:26:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T402925)', diff saved to https://phabricator.wikimedia.org/P82728 and previous config saved to /var/cache/conftool/dbconfig/20250908-152626-ladsgroup.json [15:26:31] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:26:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.491 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:28:24] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185971 (https://phabricator.wikimedia.org/T128546) [15:30:05] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1530). Please do the needful. [15:30:46] o/ - proceeding with portals update [15:30:55] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185971 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:32:37] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185971 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P82729 and previous config saved to /var/cache/conftool/dbconfig/20250908-153434-ladsgroup.json [15:34:55] (03PS13) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [15:41:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P82730 and previous config saved to /var/cache/conftool/dbconfig/20250908-154134-ladsgroup.json [15:42:44] (03PS1) 10MVernon: sretest2010: set to be installed like a new ms-be* node [puppet] - 10https://gerrit.wikimedia.org/r/1185973 (https://phabricator.wikimedia.org/T394357) [15:43:56] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1185973 (https://phabricator.wikimedia.org/T394357) (owner: 10MVernon) [15:48:43] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [15:49:06] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [15:49:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P82731 and previous config saved to /var/cache/conftool/dbconfig/20250908-154942-ladsgroup.json [15:52:31] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [15:54:35] (03PS1) 10Elukey: sre.hosts.provision: expand Supermicro models with no PXE devs in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1185975 (https://phabricator.wikimedia.org/T401964) [15:56:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P82732 and previous config saved to /var/cache/conftool/dbconfig/20250908-155641-ladsgroup.json [15:56:51] (03CR) 10Elukey: "@tklausmann@wikimedia.org: we could ideally test this with 'test-cookbook' on a depooled ml-serve host or a ml-lab one." [cookbooks] - 10https://gerrit.wikimedia.org/r/1185975 (https://phabricator.wikimedia.org/T401964) (owner: 10Elukey) [15:58:13] (03PS1) 10Ahmon Dancy: buildkitd: Bump to v0.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/1185976 (https://phabricator.wikimedia.org/T403625) [15:59:37] jouncebot: nowandnext [15:59:37] For the next 0 hour(s) and 0 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1530) [15:59:37] In 1 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1700) [15:59:37] In 1 hour(s) and 0 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1700) [15:59:46] deploying some envoy upgrades [16:00:02] (03CR) 10Eevans: [C:03+1] sretest2010: set to be installed like a new ms-be* node [puppet] - 10https://gerrit.wikimedia.org/r/1185973 (https://phabricator.wikimedia.org/T394357) (owner: 10MVernon) [16:01:46] (03CR) 10Dzahn: [C:03+1] "we have to trust it's an unused setting. maybe we can deploy it during next phab deploy window? https://phabricator.wikimedia.org/T403948" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [16:02:09] (03CR) 10RLazarus: [C:03+2] mw-videoscaler: Upgrade to envoy 1.26.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [16:03:49] (03Merged) 10jenkins-bot: mw-videoscaler: Upgrade to envoy 1.26.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [16:04:43] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1185868| Bumping portals to master (T128546)]] (duration: 15m 14s) [16:04:47] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:04:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T402925)', diff saved to https://phabricator.wikimedia.org/P82733 and previous config saved to /var/cache/conftool/dbconfig/20250908-160449-ladsgroup.json [16:04:54] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:05:06] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [16:05:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T402925)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250908-160512-ladsgroup.json [16:06:35] (03CR) 10Dzahn: [C:03+2] buildkitd: Bump to v0.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/1185976 (https://phabricator.wikimedia.org/T403625) (owner: 10Ahmon Dancy) [16:06:45] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1185868| Bumping portals to master (T128546)]] (duration: 02m 00s) [16:10:21] (03PS1) 10Elukey: sre.host.provision: move WebServer.1#HostHeaderCheck as optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1185978 (https://phabricator.wikimedia.org/T401964) [16:11:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T402925)', diff saved to https://phabricator.wikimedia.org/P82734 and previous config saved to /var/cache/conftool/dbconfig/20250908-161149-ladsgroup.json [16:11:53] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:12:05] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:12:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T402925)', diff saved to https://phabricator.wikimedia.org/P82735 and previous config saved to /var/cache/conftool/dbconfig/20250908-161212-ladsgroup.json [16:12:48] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [16:12:56] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [16:13:49] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11158650 (10elukey) I filed two code reviews to hopefully fix the provisioning for all nodes, but afaics from a search on Phabricator these ML nodes are sp... [16:16:30] !log vriley@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [16:16:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1057.eqiad.wmnet with OS bookworm [16:16:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11158668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1057.eqiad.wmnet with OS bookworm completed: - es1057 (**WARN**) -... [16:18:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11158671 (10VRiley-WMF) [16:21:48] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [16:22:08] 06SRE, 06FR-donorrelations: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986 (10AMJohnson) 03NEW [16:22:14] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1054.eqiad.wmnet with OS bookworm [16:22:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11158718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1054.eqiad.wmnet with OS bookworm [16:24:45] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1056.eqiad.wmnet with OS bookworm [16:24:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11158729 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm executed with errors: - es1056 (**F... [16:25:19] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [16:25:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11158730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [16:30:34] (03CR) 10Andrea Denisse: "Thanks Keith, I added documentation on how to use the webhook on Wikitech: https://wikitech.wikimedia.org/wiki/Alertmanager#Sending_alerts" [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [16:35:33] (03PS1) 10RLazarus: kubernetes: Set default Envoy version to 1.26.8 [puppet] - 10https://gerrit.wikimedia.org/r/1185984 (https://phabricator.wikimedia.org/T402854) [16:40:27] !log Upgrade envoyproxy on grafana2001 - T402584 [16:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:31] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [16:41:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T402925)', diff saved to https://phabricator.wikimedia.org/P82736 and previous config saved to /var/cache/conftool/dbconfig/20250908-164111-ladsgroup.json [16:41:15] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:41:21] !log Upgrade envoyproxy on grafana1002 - T402584 [16:41:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T402925)', diff saved to https://phabricator.wikimedia.org/P82737 and previous config saved to /var/cache/conftool/dbconfig/20250908-164121-ladsgroup.json [16:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:24] !log Upgrade envoyproxy on prometheus1005 - T402584 [16:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:14] vriley@cumin1003 reimage (PID 1652506) is awaiting input [16:46:15] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [16:46:33] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [16:46:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:52:06] (03PS1) 10Esanders: Enable DT thanks at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185991 (https://phabricator.wikimedia.org/T400849) [16:53:01] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185992 [16:53:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185991 (https://phabricator.wikimedia.org/T400849) (owner: 10Esanders) [16:56:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P82738 and previous config saved to /var/cache/conftool/dbconfig/20250908-165619-ladsgroup.json [16:56:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P82739 and previous config saved to /var/cache/conftool/dbconfig/20250908-165629-ladsgroup.json [16:56:37] (03CR) 10Scott French: [C:03+1] kubernetes: Set default Envoy version to 1.26.8 [puppet] - 10https://gerrit.wikimedia.org/r/1185984 (https://phabricator.wikimedia.org/T402854) (owner: 10RLazarus) [16:56:51] (03CR) 10RLazarus: [C:03+2] kubernetes: Set default Envoy version to 1.26.8 [puppet] - 10https://gerrit.wikimedia.org/r/1185984 (https://phabricator.wikimedia.org/T402854) (owner: 10RLazarus) [16:58:29] (03PS2) 10RLazarus: mediawiki: Update to configuration_1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184893 (https://phabricator.wikimedia.org/T403101) [16:58:56] (03PS1) 10RLazarus: cleanup: Remove Envoy 1.26.8 overrides now that it's the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185995 (https://phabricator.wikimedia.org/T402584) [16:59:55] (03CR) 10Scott French: [C:03+2] shellbox: update to 2025-08-29-172844 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185961 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1700). [17:00:05] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1700). [17:00:11] o/ [17:01:53] (03Merged) 10jenkins-bot: shellbox: update to 2025-08-29-172844 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185961 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:02:31] I'll be getting started on some shellbox updates shortly [17:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:06:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:19] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1054.eqiad.wmnet with OS bookworm [17:09:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11158975 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1054.eqiad.wmnet with OS bookworm executed with errors: - es1054 (**F... [17:09:55] * swfrench-wmf is getting started on shellbox updates [17:10:15] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [17:10:17] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1054.eqiad.wmnet with OS bookworm [17:10:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11158980 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1054.eqiad.wmnet with OS bookworm [17:10:43] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:10:45] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:10:57] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:10:58] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:11:10] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:11:11] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:11:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P82740 and previous config saved to /var/cache/conftool/dbconfig/20250908-171126-ladsgroup.json [17:11:29] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:11:30] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [17:11:36] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.679 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P82741 and previous config saved to /var/cache/conftool/dbconfig/20250908-171136-ladsgroup.json [17:11:46] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [17:11:48] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:12:09] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:13:45] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:14:08] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bookworm [17:14:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:16:32] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:17:18] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:17:22] !log Upgrade envoyproxy on prometheus[1006-1008] and [2005-2008] - T402584 [17:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:26] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [17:17:49] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:18:05] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:18:37] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:18:52] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:19:24] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:19:43] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:19:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.801 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:20:15] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:20:42] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:21:14] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:21:48] !log Upgrade envoyproxy on prometheus::pop hosts - T402584 [17:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:03] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:23:27] !log updated all shellbox services to 2025-08-29-172844 (+ envoy 1.26.8-1) in codfw - T403284 [17:23:27] !log Upgrade envoyproxy on titan1001 - T402584 [17:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:30] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [17:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:35] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [17:26:19] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1056.eqiad.wmnet with OS bookworm [17:26:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11159024 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm executed with errors: - es1056 (**F... [17:26:25] !log Upgrade envoyproxy on titan hosts - T402584 [17:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T402925)', diff saved to https://phabricator.wikimedia.org/P82742 and previous config saved to /var/cache/conftool/dbconfig/20250908-172634-ladsgroup.json [17:26:38] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [17:26:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T402925)', diff saved to https://phabricator.wikimedia.org/P82743 and previous config saved to /var/cache/conftool/dbconfig/20250908-172644-ladsgroup.json [17:26:50] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [17:26:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T402925)', diff saved to https://phabricator.wikimedia.org/P82744 and previous config saved to /var/cache/conftool/dbconfig/20250908-172657-ladsgroup.json [17:26:59] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [17:27:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T402925)', diff saved to https://phabricator.wikimedia.org/P82745 and previous config saved to /var/cache/conftool/dbconfig/20250908-172706-ladsgroup.json [17:28:13] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:28:42] !log Upgrade envoyproxy on graphite hosts - T402584 [17:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:46] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [17:28:51] (03CR) 10Herron: [C:03+1] "Awesome thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [17:32:03] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:32:15] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:32:59] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:33:30] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:33:45] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:34:16] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:34:32] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:35:03] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:35:04] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bookworm [17:35:07] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [17:35:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11159098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [17:35:22] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:35:30] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:35:53] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:36:19] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:36:50] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:36:58] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1054.eqiad.wmnet with reason: host reimage [17:37:31] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:38:19] !log updated all shellbox services to 2025-08-29-172844 (+ envoy 1.26.8-1) in eqiad - T403284 [17:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:23] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [17:42:41] (03PS1) 10Jasmine: wikikube: decom control plane wikikube-ctrl1001 [puppet] - 10https://gerrit.wikimedia.org/r/1186006 (https://phabricator.wikimedia.org/T383227) [17:43:16] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1054.eqiad.wmnet with reason: host reimage [17:43:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11159134 (10VRiley-WMF) [17:43:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:39] (03CR) 10Dzahn: [C:03+2] cache::text: set apt-staging to NOT cache [puppet] - 10https://gerrit.wikimedia.org/r/1180234 (https://phabricator.wikimedia.org/T402284) (owner: 10Dzahn) [17:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:49:54] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11159151 (10Aklapper) [17:51:15] (03PS1) 10Andrew Bogott: cloudcephmon1004: move to ceph version 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1186008 (https://phabricator.wikimedia.org/T402190) [17:51:20] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:51:25] Jeff_Green: frdata2001 [17:51:40] and frmx2001 -- can you confirm they have been decommissioned? [17:51:42] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bookworm [17:51:54] context: pending DNS changes on the netbox cookbook that can block other changes [17:51:58] (03CR) 10Andrew Bogott: [C:03+2] cloudcephmon1004: move to ceph version 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1186008 (https://phabricator.wikimedia.org/T402190) (owner: 10Andrew Bogott) [17:52:00] https://netbox.wikimedia.org/extras/changelog/240870/ [17:52:07] 06SRE, 06Traffic, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11159159 (10Dzahn) Deployed! And we spot checked it on cp1011. It isn't caching anymore now. [17:52:12] https://netbox.wikimedia.org/extras/changelog/240875/ [17:52:23] both confirm decommissioning status but I wanted to check. thanks. [17:52:58] sukhe: I started the decom tasks for them today [17:53:12] ok thank you, I will run the Netbox cookbook then [17:53:33] sukhe: ok thx, sorry for any confusion [17:53:33] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [17:53:36] can confirm frdata2001 not in icinga anymore [17:53:44] Jeff_Green: no worries at all [17:53:47] but probably still in puppetdb [17:53:49] mutante: yep, thanks [17:57:23] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove frdata2001 and frmx2001 - sukhe@cumin1003" [17:57:28] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove frdata2001 and frmx2001 - sukhe@cumin1003" [17:57:28] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:57:35] jouncebot: nowandnext [17:57:35] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T1700) [17:57:35] In 2 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T2000) [17:57:46] I'll helmfile-only deploy for https://gerrit.wikimedia.org/r/1184893 [17:57:56] (03CR) 10RLazarus: [C:03+2] mediawiki: Update to configuration_1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184893 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [17:59:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T402925)', diff saved to https://phabricator.wikimedia.org/P82746 and previous config saved to /var/cache/conftool/dbconfig/20250908-175931-ladsgroup.json [17:59:36] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [17:59:49] (03CR) 10Herron: [C:03+2] "as discussed on irc, going to merge this and monitor thanos / thanos-swift utilization" [puppet] - 10https://gerrit.wikimedia.org/r/1184566 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [18:00:34] (03Merged) 10jenkins-bot: mediawiki: Update to configuration_1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184893 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [18:00:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [18:00:49] (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1185938 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [18:01:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T402925)', diff saved to https://phabricator.wikimedia.org/P82747 and previous config saved to /var/cache/conftool/dbconfig/20250908-180131-ladsgroup.json [18:02:36] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [18:04:25] !log rzl@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [18:04:41] !log rzl@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [18:04:43] (03PS1) 10Andrew Bogott: pin cloudcephosd1004 to version 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1186011 (https://phabricator.wikimedia.org/T402190) [18:04:49] !log rzl@deploy1003 helmfile [staging-codfw] START helmfile.d/services/mw-debug: apply [18:05:12] !log rzl@deploy1003 helmfile [staging-codfw] DONE helmfile.d/services/mw-debug: apply [18:05:15] (03PS1) 10Scott French: shellbox-syntaxhighlight: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186009 (https://phabricator.wikimedia.org/T403284) [18:05:41] vriley@cumin1003 reimage (PID 1656553) is awaiting input [18:05:43] (03PS2) 10Andrew Bogott: pin cloudcephmon1004 to version 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1186011 (https://phabricator.wikimedia.org/T402190) [18:05:57] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:06:43] (03PS3) 10Andrew Bogott: pin cloudcephmon1004 to version 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1186011 (https://phabricator.wikimedia.org/T402190) [18:06:56] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [18:06:57] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1054.eqiad.wmnet with OS bookworm [18:07:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11159211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1054.eqiad.wmnet with OS bookworm completed: - es1054 (**PASS**) -... [18:08:19] (03CR) 10Andrew Bogott: [C:03+2] pin cloudcephmon1004 to version 'pacific' [puppet] - 10https://gerrit.wikimedia.org/r/1186011 (https://phabricator.wikimedia.org/T402190) (owner: 10Andrew Bogott) [18:08:46] 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11159215 (10Papaul) BGP is up on mr1-eqsin cr2/3-eqsin ` mr1-eqsin# run show route protocol ospf inet.0: 198 destinations, 200 routes (198 active, 0 holddown, 0 hidden) Res... [18:10:09] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:11:02] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:12:50] (03CR) 10Aklapper: "Sure, can do. That deploy window is likely gonna be Sep16, skipping Sep09" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [18:14:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P82748 and previous config saved to /var/cache/conftool/dbconfig/20250908-181439-ladsgroup.json [18:14:55] (03PS1) 10Clare Ming: xLab: Deploy v1.0.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186017 [18:16:21] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:16:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P82749 and previous config saved to /var/cache/conftool/dbconfig/20250908-181640-ladsgroup.json [18:21:06] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [18:22:50] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1056.eqiad.wmnet with OS bookworm [18:23:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11159243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm executed with errors: - es1056 (**F... [18:24:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [18:26:02] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [18:26:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11159248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [18:26:16] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill citoid Pyrra Metrics - https://phabricator.wikimedia.org/T400073#11159249 (10herron) p:05Triage→03Medium [18:26:33] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v1.0.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186017 (owner: 10Clare Ming) [18:28:40] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill citoid Pyrra Metrics - https://phabricator.wikimedia.org/T400073#11159252 (10herron) 05Open→03Resolved 6 weeks worth of metrics have been backfilled please disregard the small gap on 9/2 where the backfill period ends and current metrics begin [18:28:43] (03Merged) 10jenkins-bot: xLab: Deploy v1.0.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186017 (owner: 10Clare Ming) [18:29:11] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1184893 and https://gerrit.wikimedia.org/r/1185984 [18:29:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P82750 and previous config saved to /var/cache/conftool/dbconfig/20250908-182946-ladsgroup.json [18:31:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P82751 and previous config saved to /var/cache/conftool/dbconfig/20250908-183147-ladsgroup.json [18:31:55] (03PS1) 10Herron: pyrra: tonecheck: bump revision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1186022 [18:35:27] !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1184893 and https://gerrit.wikimedia.org/r/1185984 (duration: 14m 34s) [18:35:42] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [18:36:20] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [18:36:47] (03PS1) 10Dzahn: zuul: factor webserver/proxy out into its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1186023 (https://phabricator.wikimedia.org/T401614) [18:37:10] (03PS2) 10Herron: pyrra: tonecheck: bump revision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1186022 [18:37:23] (03PS2) 10Dzahn: zuul: factor webserver/proxy out into its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1186023 (https://phabricator.wikimedia.org/T395938) [18:38:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11159281 (10KFrancis) Hi @mahmoud.abdelsattar.wmde I'm waiting on legal counsel to counter sign the agreement. I... [18:38:47] (03PS3) 10Dzahn: zuul: factor webserver/proxy out into its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1186023 (https://phabricator.wikimedia.org/T395938) [18:39:46] (03PS3) 10Herron: pyrra: tonecheck: bump revision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1186022 (https://phabricator.wikimedia.org/T400071) [18:43:14] (03PS6) 10Bking: opensearch-operator: create helmfile for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [18:43:40] (03PS1) 10Dzahn: admin: upgrade musikanimal from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1186024 (https://phabricator.wikimedia.org/T403868) [18:44:37] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1004.eqiad.wmnet with OS bookworm [18:44:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T402925)', diff saved to https://phabricator.wikimedia.org/P82752 and previous config saved to /var/cache/conftool/dbconfig/20250908-184454-ladsgroup.json [18:44:58] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [18:45:10] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [18:45:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T402925)', diff saved to https://phabricator.wikimedia.org/P82753 and previous config saved to /var/cache/conftool/dbconfig/20250908-184517-ladsgroup.json [18:46:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T402925)', diff saved to https://phabricator.wikimedia.org/P82754 and previous config saved to /var/cache/conftool/dbconfig/20250908-184655-ladsgroup.json [18:47:12] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [18:47:15] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868#11159327 (10Dzahn) An older version of L3 was signed back in 2017. NDA can be assumed as staff member. group approver and team manager already approved. existing she... [18:47:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T402925)', diff saved to https://phabricator.wikimedia.org/P82755 and previous config saved to /var/cache/conftool/dbconfig/20250908-184718-ladsgroup.json [18:47:29] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11159330 (10ssingh) We merged the NOOP change that implements the new YAML-based `pdns-recursor` config. We have not enabled it anywhere yet because we don't have a host that ha... [18:48:30] (03PS1) 10RLazarus: all charts: Update to mesh.configuration 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186028 (https://phabricator.wikimedia.org/T403101) [18:50:40] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11159335 (10ssingh) One of the related issues here is figuring out what `pdns-recursor` settings are actually applied, since the vary across prod DNS, Wikimedia DNS, and WMCS re... [18:57:00] vriley@cumin1003 reimage (PID 1667999) is awaiting input [19:00:41] !log hashar@deploy1003 Started deploy [integration/docroot@f89c693]: build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 [19:00:54] !log hashar@deploy1003 Finished deploy [integration/docroot@f89c693]: build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 (duration: 00m 13s) [19:02:13] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11159409 (10Reedy) Isn't this handled by ITS these days? [19:03:20] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1056.eqiad.wmnet with OS bookworm [19:03:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11159414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm executed with errors: - es1056 (**F... [19:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:15:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T402925)', diff saved to https://phabricator.wikimedia.org/P82756 and previous config saved to /var/cache/conftool/dbconfig/20250908-191501-ladsgroup.json [19:15:06] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [19:16:45] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11159496 (10Aklapper) Likely... Are there discoverable docs which describe how anyone could find out somehow? [19:16:50] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11159499 (10Dzahn) While there are still some special cases that forward mail to donate@ in files controlled by SRE these should all be about wikipedia.org (fo... [19:17:20] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11159501 (10Dzahn) >>! In T403986#11159496, @Aklapper wrote: > Likely... Are there discoverable docs which describe how anyone could find out somehow? Not sin... [19:19:57] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1186023/6879/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1186023 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:20:12] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11159506 (10ssingh) >>! In T403986#11159409, @Reedy wrote: > Isn't this handled by ITS these days? [Adding @jhathaway] We do handle some aliases (mostly the... [19:20:26] mutante: wow timing on that task. a few seconds apart. [19:21:05] sukhe: :)) yes, the ones that are left with SRE should be wikipedia.org stuff, not wikimedia.org [19:21:21] I know because it took quite some effort to move them. [19:21:39] yep thanks for confirming [19:21:47] (03PS1) 10Andrew Bogott: cloudcephmon1004: switch partman to manual setup [puppet] - 10https://gerrit.wikimedia.org/r/1186035 [19:21:56] with exim the check used to be "exim4 -bt
" on any MX [19:22:04] I dont know what replaces that in postfix [19:22:13] (03CR) 10Bking: dse-k8s: Introduce opensearch-operator namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184568 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:22:15] I have no idea tbh -- I just grep the hiera [19:22:38] yea, same. but that was a check to definitely see where it routes an address [19:22:54] i think I asked before and there was no simple equivalent command [19:23:14] 06SRE, 10LDAP-Access-Requests: Grant Access to Wmf LDAP group for FRomeo (WMF) - https://phabricator.wikimedia.org/T403960#11159513 (10Aklapper) 05Open→03Invalid Hi, please see https://phabricator.wikimedia.org/project/profile/1564/ - thanks! [19:23:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T402925)', diff saved to https://phabricator.wikimedia.org/P82757 and previous config saved to /var/cache/conftool/dbconfig/20250908-192347-ladsgroup.json [19:23:52] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [19:24:19] (03CR) 10Andrew Bogott: [C:03+2] cloudcephmon1004: switch partman to manual setup [puppet] - 10https://gerrit.wikimedia.org/r/1186035 (owner: 10Andrew Bogott) [19:25:24] (03PS3) 10DLynch: Update VE core submodule to master (a5bd08c8b) [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185982 (https://phabricator.wikimedia.org/T302413) [19:25:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185982 (https://phabricator.wikimedia.org/T302413) (owner: 10DLynch) [19:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:26:06] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [19:28:19] (03PS1) 10Scott French: P:trafficserver::backend: add mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1184914 (https://phabricator.wikimedia.org/T403655) [19:28:21] (03PS2) 10Scott French: hieradata: add mw-next-routing to ATS tslua plugin chains [puppet] - 10https://gerrit.wikimedia.org/r/1184915 (https://phabricator.wikimedia.org/T403655) [19:29:59] (03PS2) 10Umherirrender: build: Updating mediawiki/mediawiki-codesniffer to 48.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) [19:30:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P82758 and previous config saved to /var/cache/conftool/dbconfig/20250908-193009-ladsgroup.json [19:31:35] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11159569 (10CDobbins) @ssingh: that's right. I thought about this a bit over the weekend, and I think the easiest approach is going to be doing a clean install in a VM, grabbing... [19:31:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:34:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:37:05] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [19:37:25] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [19:38:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P82759 and previous config saved to /var/cache/conftool/dbconfig/20250908-193855-ladsgroup.json [19:39:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:41:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 1.995 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:43:47] (03PS7) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [19:43:56] (03CR) 10CI reject: [V:04-1] opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:44:32] (03Abandoned) 10DLynch: Move ve.track.js into a separate module [VisualEditor/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185970 (https://phabricator.wikimedia.org/T403745) (owner: 10DLynch) [19:45:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P82760 and previous config saved to /var/cache/conftool/dbconfig/20250908-194516-ladsgroup.json [19:46:27] (03PS2) 10RLazarus: cleanup: Remove Envoy 1.26.8 overrides now that it's the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185995 (https://phabricator.wikimedia.org/T402584) [19:47:48] (03CR) 10RLazarus: "This doesn't touch mathoid, which is still pinned to envoy-future:1.26.8-3 but I'm about to bump it separately to 1.29." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185995 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [19:49:24] (03PS3) 10RLazarus: cleanup: Remove Envoy 1.26.8 overrides now that it's the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185995 (https://phabricator.wikimedia.org/T402584) [19:49:44] (03PS1) 10Jdlrobson: Temporarily use production for summary endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186044 (https://phabricator.wikimedia.org/T400694) [19:51:33] (03PS5) 10Jdlrobson: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 [19:51:58] (03PS1) 10Dzahn: zuul::executor: add zuul config file [puppet] - 10https://gerrit.wikimedia.org/r/1186045 (https://phabricator.wikimedia.org/T403847) [19:54:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P82761 and previous config saved to /var/cache/conftool/dbconfig/20250908-195402-ladsgroup.json [19:55:43] (03CR) 10Dzahn: [V:03+1 C:03+2] "wmflib::dir::mkdir_p('/etc/zuul/ssh') already ensures the path exists" [puppet] - 10https://gerrit.wikimedia.org/r/1186045 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [19:56:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:56:49] (03CR) 10RLazarus: "No diffs as expected, after I pulled out the {api,rest}-gateway changes that were in by mistake. Thanks helm-lint! https://integration.wik" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185995 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [19:57:32] (03PS8) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [19:57:41] (03CR) 10CI reject: [V:04-1] opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:59:49] (03PS2) 10RLazarus: envoy-future: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185232 (https://phabricator.wikimedia.org/T403663) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T2000). [20:00:05] RoanKattouw, kemayo, Superpes, and maryum: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] o/ [20:00:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T402925)', diff saved to https://phabricator.wikimedia.org/P82762 and previous config saved to /var/cache/conftool/dbconfig/20250908-200024-ladsgroup.json [20:00:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [20:00:30] I can do my own deploy. [20:00:40] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [20:00:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T402925)', diff saved to https://phabricator.wikimedia.org/P82763 and previous config saved to /var/cache/conftool/dbconfig/20250908-200047-ladsgroup.json [20:01:28] (03PS9) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [20:01:36] (03CR) 10CI reject: [V:04-1] opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:01:52] I'll get my own stuff started, since nobody else is here yet. [20:02:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185991 (https://phabricator.wikimedia.org/T400849) (owner: 10Esanders) [20:02:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185982 (https://phabricator.wikimedia.org/T302413) (owner: 10DLynch) [20:02:20] I'm here! [20:02:25] I can also do my own deploy [20:02:33] Roan might be coming later in the window [20:02:53] The curse of the deployments list sort of being a queue and also sort of not. [20:03:10] (03Merged) 10jenkins-bot: Enable DT thanks at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185991 (https://phabricator.wikimedia.org/T400849) (owner: 10Esanders) [20:04:23] (03Merged) 10jenkins-bot: Update VE core submodule to master (a5bd08c8b) [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185982 (https://phabricator.wikimedia.org/T302413) (owner: 10DLynch) [20:04:40] Ooh, it's a quick-merging day. [20:04:45] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1185991|Enable DT thanks at mediawikiwiki (T400849)]], [[gerrit:1185982|Update VE core submodule to master (a5bd08c8b) (T302413 T391521 T397145 T401890 T402392 T397518 T402717 T403741 T403745)]] [20:05:12] T400849: Enable "Thanks" from talk pages as an opt-in beta feature - https://phabricator.wikimedia.org/T400849 [20:05:13] T302413: Visual diff of templates inside the table shows all descriptions at the bottom, in backwards order - https://phabricator.wikimedia.org/T302413 [20:05:13] T391521: VE: Deleting sub-ref attached to main content does not orphan other sub-refs in same article - https://phabricator.wikimedia.org/T391521 [20:05:14] T397145: Move footnote numbering information out of singleton document cache - https://phabricator.wikimedia.org/T397145 [20:05:14] T401890: Long link labels don't show ellipsis in link context on mobile - https://phabricator.wikimedia.org/T401890 [20:05:15] T402392: Bring basic reference functionality into VisualEditor standalone - https://phabricator.wikimedia.org/T402392 [20:05:15] T397518: VisualDiff should use vertical ellipsis consistenly - https://phabricator.wikimedia.org/T397518 [20:05:16] T402717: ClipboardHandler preserves existing ImportedDataAnnotation when pasting over previously-pasted content - https://phabricator.wikimedia.org/T402717 [20:05:16] T403741: Move annotation-removal logic out of AnnotationAction into SurfaceFragment - https://phabricator.wikimedia.org/T403741 [20:05:16] T403745: ve.track module isn't loaded when launching 2017 editor on a page with discussiontools enabled - https://phabricator.wikimedia.org/T403745 [20:06:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 1.863 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:07:22] I'll be there in 20-30 minutes [20:09:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T402925)', diff saved to https://phabricator.wikimedia.org/P82764 and previous config saved to /var/cache/conftool/dbconfig/20250908-200910-ladsgroup.json [20:09:15] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [20:09:27] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [20:09:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T402925)', diff saved to https://phabricator.wikimedia.org/P82765 and previous config saved to /var/cache/conftool/dbconfig/20250908-200934-ladsgroup.json [20:11:02] !log kemayo@deploy1003 kemayo, esanders: Backport for [[gerrit:1185991|Enable DT thanks at mediawikiwiki (T400849)]], [[gerrit:1185982|Update VE core submodule to master (a5bd08c8b) (T302413 T391521 T397145 T401890 T402392 T397518 T402717 T403741 T403745)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:21] T400849: Enable "Thanks" from talk pages as an opt-in beta feature - https://phabricator.wikimedia.org/T400849 [20:11:21] T302413: Visual diff of templates inside the table shows all descriptions at the bottom, in backwards order - https://phabricator.wikimedia.org/T302413 [20:11:22] T391521: VE: Deleting sub-ref attached to main content does not orphan other sub-refs in same article - https://phabricator.wikimedia.org/T391521 [20:11:22] T397145: Move footnote numbering information out of singleton document cache - https://phabricator.wikimedia.org/T397145 [20:11:23] T401890: Long link labels don't show ellipsis in link context on mobile - https://phabricator.wikimedia.org/T401890 [20:11:23] T402392: Bring basic reference functionality into VisualEditor standalone - https://phabricator.wikimedia.org/T402392 [20:11:23] T397518: VisualDiff should use vertical ellipsis consistenly - https://phabricator.wikimedia.org/T397518 [20:11:24] T402717: ClipboardHandler preserves existing ImportedDataAnnotation when pasting over previously-pasted content - https://phabricator.wikimedia.org/T402717 [20:11:24] T403741: Move annotation-removal logic out of AnnotationAction into SurfaceFragment - https://phabricator.wikimedia.org/T403741 [20:11:24] T403745: ve.track module isn't loaded when launching 2017 editor on a page with discussiontools enabled - https://phabricator.wikimedia.org/T403745 [20:11:49] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy-future: Update to v1.29.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185232 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [20:12:26] !log kemayo@deploy1003 kemayo, esanders: Continuing with sync [20:13:32] (03PS10) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [20:13:40] (03CR) 10CI reject: [V:04-1] opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:15:32] (03PS11) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [20:16:07] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11159914 (10SLong-WMF) Hey folks! I was made aware of this on Friday and can help keep things organized and reported here (since I manage the QS team). I'... [20:17:50] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1185991|Enable DT thanks at mediawikiwiki (T400849)]], [[gerrit:1185982|Update VE core submodule to master (a5bd08c8b) (T302413 T391521 T397145 T401890 T402392 T397518 T402717 T403741 T403745)]] (duration: 13m 05s) [20:18:01] great, going to go ahead and get mine started [20:18:06] T400849: Enable "Thanks" from talk pages as an opt-in beta feature - https://phabricator.wikimedia.org/T400849 [20:18:06] T302413: Visual diff of templates inside the table shows all descriptions at the bottom, in backwards order - https://phabricator.wikimedia.org/T302413 [20:18:07] T391521: VE: Deleting sub-ref attached to main content does not orphan other sub-refs in same article - https://phabricator.wikimedia.org/T391521 [20:18:07] T397145: Move footnote numbering information out of singleton document cache - https://phabricator.wikimedia.org/T397145 [20:18:08] T401890: Long link labels don't show ellipsis in link context on mobile - https://phabricator.wikimedia.org/T401890 [20:18:08] T402392: Bring basic reference functionality into VisualEditor standalone - https://phabricator.wikimedia.org/T402392 [20:18:09] T397518: VisualDiff should use vertical ellipsis consistenly - https://phabricator.wikimedia.org/T397518 [20:18:09] T402717: ClipboardHandler preserves existing ImportedDataAnnotation when pasting over previously-pasted content - https://phabricator.wikimedia.org/T402717 [20:18:10] T403741: Move annotation-removal logic out of AnnotationAction into SurfaceFragment - https://phabricator.wikimedia.org/T403741 [20:18:10] T403745: ve.track module isn't loaded when launching 2017 editor on a page with discussiontools enabled - https://phabricator.wikimedia.org/T403745 [20:18:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [20:18:24] maryum: Enjoy! [20:18:33] thanks! [20:18:58] (03PS1) 10Andrew Bogott: remove hiera host file for server that doesn't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/1186054 [20:18:58] (03PS1) 10Andrew Bogott: cloudcephmon1004: return to standard preseed [puppet] - 10https://gerrit.wikimedia.org/r/1186055 [20:19:33] (03PS12) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [20:19:34] (03Merged) 10jenkins-bot: OATHAuth: Enable 2FA opt-in for 10% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [20:19:51] !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1174049|OATHAuth: Enable 2FA opt-in for 10% of users (T400579)]] [20:19:55] T400579: Add ability to make 2FA available to N% of users - https://phabricator.wikimedia.org/T400579 [20:21:41] (03CR) 10Andrew Bogott: [C:03+2] remove hiera host file for server that doesn't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/1186054 (owner: 10Andrew Bogott) [20:22:24] (03CR) 10Andrew Bogott: [C:03+2] cloudcephmon1004: return to standard preseed [puppet] - 10https://gerrit.wikimedia.org/r/1186055 (owner: 10Andrew Bogott) [20:23:01] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [20:23:56] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [20:24:20] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [20:24:47] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [20:26:12] !log mstyles@deploy1003 mstyles: Backport for [[gerrit:1174049|OATHAuth: Enable 2FA opt-in for 10% of users (T400579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:26:16] T400579: Add ability to make 2FA available to N% of users - https://phabricator.wikimedia.org/T400579 [20:27:59] !log mstyles@deploy1003 mstyles: Continuing with sync [20:28:09] (03PS13) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [20:29:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:33:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T402925)', diff saved to https://phabricator.wikimedia.org/P82766 and previous config saved to /var/cache/conftool/dbconfig/20250908-203317-ladsgroup.json [20:33:19] !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174049|OATHAuth: Enable 2FA opt-in for 10% of users (T400579)]] (duration: 13m 28s) [20:33:22] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [20:33:25] T400579: Add ability to make 2FA available to N% of users - https://phabricator.wikimedia.org/T400579 [20:33:45] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [20:34:05] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [20:34:16] maryum I'm here! Sorry for the late :) [20:34:24] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11160014 (10SLong-WMF) [20:34:27] no worries I just finished deploying my change [20:34:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:36:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:41:33] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 7.635 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:43:28] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11160029 (10jhathaway) On the mx-in servers you can obtain routing information via `sendmail -bv`, however it is a bit more annoying to work with compared to `... [20:43:44] I'm here now [20:43:57] Superpes: Are you able to deploy your own change or should I do it for you? [20:44:13] (03CR) 10Catrope: [C:03+2] Fix display of Codex message icons II [skins/Vector] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185159 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:44:53] RoanKattouw I can't deploy :) [20:45:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T402925)', diff saved to https://phabricator.wikimedia.org/P82767 and previous config saved to /var/cache/conftool/dbconfig/20250908-204517-ladsgroup.json [20:45:21] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [20:46:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185105 (https://phabricator.wikimedia.org/T402083) (owner: 10Superpes15) [20:48:05] OK let's start with the lbwiki change [20:48:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P82768 and previous config saved to /var/cache/conftool/dbconfig/20250908-204824-ladsgroup.json [20:48:53] (03Merged) 10jenkins-bot: [lbwiki] Change to 'uca-lb-u-kn' category collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185105 (https://phabricator.wikimedia.org/T402083) (owner: 10Superpes15) [20:49:08] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1185105|[lbwiki] Change to 'uca-lb-u-kn' category collation (T402083)]] [20:49:11] T402083: Set $wgCategoryCollation for lbwiki to uca-lb-u-kn - https://phabricator.wikimedia.org/T402083 [20:51:15] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11160064 (10DSeyfert_WMF) Thank you everyone - the history of this address is why I wanted to confirm, thank you for your help! We'd greatly appreciate if ther... [20:54:23] (03Merged) 10jenkins-bot: Fix display of Codex message icons II [skins/Vector] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185159 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [20:55:21] !log catrope@deploy1003 catrope, superpes: Backport for [[gerrit:1185105|[lbwiki] Change to 'uca-lb-u-kn' category collation (T402083)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:55:25] T402083: Set $wgCategoryCollation for lbwiki to uca-lb-u-kn - https://phabricator.wikimedia.org/T402083 [20:55:44] I don't think this can easily be tested, we'll just have to roll it out and then run the script [20:55:55] !log catrope@deploy1003 catrope, superpes: Continuing with sync [20:55:59] Yep, I just tested the api, it works! [20:55:59] That sounds like testing to me :) [20:58:38] FTR https://lb.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=general|namespaces&format=json with WikimediaDebug enabled resulted in "categorycollation":"uca-lb-u-kn" [21:00:04] Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T2100). [21:00:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P82769 and previous config saved to /var/cache/conftool/dbconfig/20250908-210024-ladsgroup.json [21:01:19] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1185105|[lbwiki] Change to 'uca-lb-u-kn' category collation (T402083)]] (duration: 12m 11s) [21:01:22] T402083: Set $wgCategoryCollation for lbwiki to uca-lb-u-kn - https://phabricator.wikimedia.org/T402083 [21:02:23] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:03:32] I'm getting ready to deploy one security patch [21:03:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:03:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P82770 and previous config saved to /var/cache/conftool/dbconfig/20250908-210332-ladsgroup.json [21:04:00] Collation maintenance script it running [21:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:04:20] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1185159|Fix display of Codex message icons II (T401457)]] [21:04:23] T401457: Message: Fix height of CSS-only message icon - https://phabricator.wikimedia.org/T401457 [21:04:45] It should be quick [21:05:10] okay cool [21:10:00] !log catrope@deploy1003 catrope: Backport for [[gerrit:1185159|Fix display of Codex message icons II (T401457)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:04] T401457: Message: Fix height of CSS-only message icon - https://phabricator.wikimedia.org/T401457 [21:10:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:11:11] 06SRE: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008 (10Reedy) 03NEW [21:11:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:12:46] 06SRE: docker-registry "Last updated at" time should specify TZ - https://phabricator.wikimedia.org/T404010 (10Reedy) 03NEW [21:13:43] 06SRE: docker-registry will show different last updated TZ as you refresh the page... - https://phabricator.wikimedia.org/T404011 (10Reedy) 03NEW [21:15:17] !log Ran updateCollation.php on lbwiki for T402083 [21:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:21] T402083: Set $wgCategoryCollation for lbwiki to uca-lb-u-kn - https://phabricator.wikimedia.org/T402083 [21:15:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P82771 and previous config saved to /var/cache/conftool/dbconfig/20250908-211532-ladsgroup.json [21:15:36] Superpes: The updateCollation script is done running on lbwiki, all the categories should be re-collated now [21:15:45] Thanks :) [21:15:50] !log catrope@deploy1003 catrope: Continuing with sync [21:16:49] 07Puppet, 06SRE, 06Release-Engineering-Team: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11160187 (10Reedy) [21:17:37] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11160189 (10SLong-WMF) From Native Apps: They're researching the impact this may have on Native Apps currently. @ABorbaWMF will be producing a test plan i... [21:17:42] 06SRE, 06Release-Engineering-Team: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11160190 (10Reedy) [21:17:46] 06SRE, 06Release-Engineering-Team: docker-registry "Last updated at" time should specify TZ - https://phabricator.wikimedia.org/T404010#11160191 (10Reedy) [21:17:50] 06SRE, 06Release-Engineering-Team: docker-registry will show different last updated TZ as you refresh the page... - https://phabricator.wikimedia.org/T404011#11160192 (10Reedy) [21:18:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T402925)', diff saved to https://phabricator.wikimedia.org/P82772 and previous config saved to /var/cache/conftool/dbconfig/20250908-211840-ladsgroup.json [21:18:44] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [21:18:55] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [21:19:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T402925)', diff saved to https://phabricator.wikimedia.org/P82773 and previous config saved to /var/cache/conftool/dbconfig/20250908-211902-ladsgroup.json [21:20:11] is everything good for me to deploy? [21:20:32] !log bking@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for wdqs2025.codfw.wmnet: Renew puppet certificate - bking@cumin1002 [21:21:06] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1185159|Fix display of Codex message icons II (T401457)]] (duration: 16m 46s) [21:21:10] T401457: Message: Fix height of CSS-only message icon - https://phabricator.wikimedia.org/T401457 [21:21:33] maryum: Hold on, there's one more patch for Superpes [21:22:14] Yep if you have time! Otherwise I can reschedule it :) [21:23:00] Yes let's reschedule actually if you don't mind Superpes [21:23:06] So maryum go ahead [21:23:12] awesome, thanks [21:23:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:23:35] No issue for me! Thanks for your time :) [21:24:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:25:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:26:16] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:27:06] scap is running [21:28:04] (03CR) 10Scott French: [C:03+1] cleanup: Remove Envoy 1.26.8 overrides now that it's the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185995 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [21:28:51] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11160276 (10Dzahn) > sendmail -bv aha, thanks for adding that, @jhathaway > discoverable way we can check this Not really, because that is in a non-public... [21:29:03] andrew@cumin2002 reimage (PID 959327) is awaiting input [21:29:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:30:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T402925)', diff saved to https://phabricator.wikimedia.org/P82774 and previous config saved to /var/cache/conftool/dbconfig/20250908-213040-ladsgroup.json [21:30:44] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [21:30:56] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [21:31:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T402925)', diff saved to https://phabricator.wikimedia.org/P82775 and previous config saved to /var/cache/conftool/dbconfig/20250908-213103-ladsgroup.json [21:31:32] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11160295 (10Dzahn) Another way to look for history is to browse the title of the subtasks of T122144. [21:32:52] (03CR) 10Awight: "I53ef996b1c72f is tests only, so no reason to deploy I guess." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185992 (owner: 10PipelineBot) [21:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:34:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.741 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:45] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:38:16] !log Deployed security fix for T403408 [21:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:24] scap finished, enjoy everyone [21:38:46] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:42:35] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:43:34] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:43:57] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:44:13] (03PS14) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [21:45:54] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:46:57] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:48:47] (03PS15) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [21:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:49:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T402925)', diff saved to https://phabricator.wikimedia.org/P82776 and previous config saved to /var/cache/conftool/dbconfig/20250908-214941-ladsgroup.json [21:49:46] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [21:49:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:51:06] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:51:36] (03PS16) 10Bking: opensearch-operator: create namespace and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184572 (https://phabricator.wikimedia.org/T397246) [21:52:08] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:56:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 1.567 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:53] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:58:53] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:01:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11160404 (10Jhancock.wm) @elukey we got a new version here. Config-G. looks like the provisioning script doesn't agree with the console redirect settings present on the server. it is odd cause i... [22:04:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P82777 and previous config saved to /var/cache/conftool/dbconfig/20250908-220449-ladsgroup.json [22:07:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T402925)', diff saved to https://phabricator.wikimedia.org/P82778 and previous config saved to /var/cache/conftool/dbconfig/20250908-220728-ladsgroup.json [22:07:32] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:07:54] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [22:09:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:19:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P82779 and previous config saved to /var/cache/conftool/dbconfig/20250908-221956-ladsgroup.json [22:22:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P82780 and previous config saved to /var/cache/conftool/dbconfig/20250908-222235-ladsgroup.json [22:23:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [22:35:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T402925)', diff saved to https://phabricator.wikimedia.org/P82781 and previous config saved to /var/cache/conftool/dbconfig/20250908-223504-ladsgroup.json [22:35:10] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:35:21] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [22:35:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T402925)', diff saved to https://phabricator.wikimedia.org/P82782 and previous config saved to /var/cache/conftool/dbconfig/20250908-223528-ladsgroup.json [22:37:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P82783 and previous config saved to /var/cache/conftool/dbconfig/20250908-223742-ladsgroup.json [22:39:02] (03Abandoned) 10Ladsgroup: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170555 (https://phabricator.wikimedia.org/T399954) (owner: 10Gerrit maintenance bot) [22:39:06] (03Abandoned) 10Ladsgroup: mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1170554 (https://phabricator.wikimedia.org/T399954) (owner: 10Gerrit maintenance bot) [22:40:14] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1186078 (https://phabricator.wikimedia.org/T404025) [22:40:19] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186079 (https://phabricator.wikimedia.org/T404025) [22:42:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T404025 [22:42:55] T404025: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T404025 [22:43:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set db1189 with weight 0 T404025', diff saved to https://phabricator.wikimedia.org/P82784 and previous config saved to /var/cache/conftool/dbconfig/20250908-224330-ladsgroup.json [22:47:32] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1186078 (https://phabricator.wikimedia.org/T404025) (owner: 10Gerrit maintenance bot) [22:48:51] !log Starting s3 eqiad failover from db1223 to db1189 - T404025 [22:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:55] T404025: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T404025 [22:49:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T404025', diff saved to https://phabricator.wikimedia.org/P82785 and previous config saved to /var/cache/conftool/dbconfig/20250908-224914-ladsgroup.json [22:50:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Promote db1189 to s3 primary and set section read-write T404025', diff saved to https://phabricator.wikimedia.org/P82786 and previous config saved to /var/cache/conftool/dbconfig/20250908-225054-ladsgroup.json [22:52:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T402925)', diff saved to https://phabricator.wikimedia.org/P82787 and previous config saved to /var/cache/conftool/dbconfig/20250908-225250-ladsgroup.json [22:52:54] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:52:58] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186079 (https://phabricator.wikimedia.org/T404025) (owner: 10Gerrit maintenance bot) [22:53:06] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [22:53:12] !log ladsgroup@dns1004 START - running authdns-update [22:53:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T402925)', diff saved to https://phabricator.wikimedia.org/P82788 and previous config saved to /var/cache/conftool/dbconfig/20250908-225313-ladsgroup.json [22:54:15] !log ladsgroup@dns1004 END - running authdns-update [22:56:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1223 T404025', diff saved to https://phabricator.wikimedia.org/P82789 and previous config saved to /var/cache/conftool/dbconfig/20250908-225603-ladsgroup.json [22:56:07] T404025: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T404025 [22:58:46] (03CR) 10RLazarus: [C:03+2] cleanup: Remove Envoy 1.26.8 overrides now that it's the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185995 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [22:59:04] (03PS2) 10Jdlrobson: Temporarily use production for summary endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186044 (https://phabricator.wikimedia.org/T400694) [22:59:17] doing /me takes the deploy conch [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250908T2300) [23:01:58] (03Merged) 10jenkins-bot: cleanup: Remove Envoy 1.26.8 overrides now that it's the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185995 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [23:02:10] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for db1223.eqiad.wmnet [23:02:33] (03PS6) 10Jdlrobson: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 [23:02:50] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.depool db1223 - Upgrading db1223.eqiad.wmnet [23:02:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1223 - Upgrading db1223.eqiad.wmnet [23:03:47] (03PS7) 10Jdlrobson: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 [23:04:08] (03CR) 10Jdlrobson: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [23:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186044 (https://phabricator.wikimedia.org/T400694) (owner: 10Jdlrobson) [23:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [23:04:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [23:04:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186044 (https://phabricator.wikimedia.org/T400694) (owner: 10Jdlrobson) [23:05:03] (03Merged) 10jenkins-bot: Temporarily use production for summary endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186044 (https://phabricator.wikimedia.org/T400694) (owner: 10Jdlrobson) [23:05:26] (03Merged) 10jenkins-bot: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [23:05:41] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1182944|Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled]], [[gerrit:1186044|Temporarily use production for summary endpoint (T400694)]] [23:05:45] T400694: [CI] selenium-daily-beta-Popups tests failing since July 17 - https://phabricator.wikimedia.org/T400694 [23:08:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1223.eqiad.wmnet [23:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:10:21] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1223 gradually with 4 steps - Maint over [23:11:39] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1182944|Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled]], [[gerrit:1186044|Temporarily use production for summary endpoint (T400694)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:11:42] T400694: [CI] selenium-daily-beta-Popups tests failing since July 17 - https://phabricator.wikimedia.org/T400694 [23:14:29] (testing) [23:14:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:16:17] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [23:18:46] andrew@cumin2002 reimage (PID 1012398) is awaiting input [23:21:47] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182944|Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled]], [[gerrit:1186044|Temporarily use production for summary endpoint (T400694)]] (duration: 16m 06s) [23:21:51] T400694: [CI] selenium-daily-beta-Popups tests failing since July 17 - https://phabricator.wikimedia.org/T400694 [23:23:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1186089 (https://phabricator.wikimedia.org/T404027) [23:23:58] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1186090 (https://phabricator.wikimedia.org/T404027) [23:24:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:25:43] (done) [23:26:35] borrowing mw-debug for a Science Experiment [23:28:02] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [23:29:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T402925)', diff saved to https://phabricator.wikimedia.org/P82791 and previous config saved to /var/cache/conftool/dbconfig/20250908-232912-ladsgroup.json [23:29:16] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:29:24] (03PS1) 10Ladsgroup: db1223: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186093 (https://phabricator.wikimedia.org/T399548) [23:29:50] !log ladsgroup@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1223 gradually with 4 steps - Maint over [23:29:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T402925)', diff saved to https://phabricator.wikimedia.org/P82793 and previous config saved to /var/cache/conftool/dbconfig/20250908-232958-ladsgroup.json [23:30:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Upgrade db1223 to MariaDB 10.11 (T399548)', diff saved to https://phabricator.wikimedia.org/P82794 and previous config saved to /var/cache/conftool/dbconfig/20250908-233042-ladsgroup.json [23:30:46] T399548: Migrate s3 to MariaDB 10.11 - https://phabricator.wikimedia.org/T399548 [23:30:52] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [23:31:12] !log helmfile -e eqiad -i apply --set mesh.image_name=envoy-future --set mesh.image_version=1.29.12-1 --context=5 # T403663 [23:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:15] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [23:31:24] (03CR) 10Ladsgroup: [C:03+2] db1223: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1186093 (https://phabricator.wikimedia.org/T399548) (owner: 10Ladsgroup) [23:31:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:31:45] oops that should have said "mw-debug" :) updated the SAL [23:33:00] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1223.eqiad.wmnet with reason: Upgrade to 10.11 [23:36:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.756 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:38:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1186097 [23:38:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1186097 (owner: 10TrainBranchBot) [23:44:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P82795 and previous config saved to /var/cache/conftool/dbconfig/20250908-234419-ladsgroup.json [23:45:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P82796 and previous config saved to /var/cache/conftool/dbconfig/20250908-234506-ladsgroup.json [23:46:21] (03CR) 10Dr0ptp4kt: "Checking for @ltoscano@wikimedia.org take on this and its sibling patch." [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [23:50:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1186097 (owner: 10TrainBranchBot) [23:51:17] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:53:39] (03PS1) 10RLazarus: mathoid: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186099 (https://phabricator.wikimedia.org/T403663) [23:53:43] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1223 gradually with 4 steps - Maint over [23:56:17] RESOLVED: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:59:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P82798 and previous config saved to /var/cache/conftool/dbconfig/20250908-235927-ladsgroup.json