[00:01:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:02:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 0.355 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:04:40] RESOLVED: DiskSpace: Disk space ml-serve1012:9100:/ 1.21% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:05:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:13] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:10:17] FIRING: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:17:05] That'd explain it [00:17:42] PROBLEM - HTTPS on gerrit1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/330/ [00:18:34] !log restart apache2 on gerrit1003 [00:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:42] RECOVERY - HTTPS on gerrit1003 is OK: SSL OK - Certificate gerrit.wikimedia.org valid until 2025-12-28 16:37:29 +0000 (expires in 66 days) https://phabricator.wikimedia.org/project/view/330/ [00:19:13] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:19:15] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:19:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:26] still seems down to me [00:20:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:22:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:24:02] PROBLEM - HTTPS on gerrit1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/330/ [00:24:13] RESOLVED: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:25:04] RECOVERY - HTTPS on gerrit1003 is OK: SSL OK - Certificate gerrit.wikimedia.org valid until 2025-12-28 16:37:29 +0000 (expires in 66 days) https://phabricator.wikimedia.org/project/view/330/ [00:25:39] ok, up now [00:27:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:27:34] back down or slow I guess [00:27:58] host stats look like nothing much is happening, CPU etc. is fine [00:28:14] apache stats just disappear every time there's a problem so you don't really get any insight that way [00:29:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:31:06] PROBLEM - HTTPS on gerrit1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/330/ [00:31:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:31:14] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:31:14] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:31:16] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:31:16] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:31:38] RECOVERY - HTTPS on gerrit1003 is OK: SSL OK - Certificate gerrit.wikimedia.org valid until 2025-12-28 16:37:29 +0000 (expires in 66 days) https://phabricator.wikimedia.org/project/view/330/ [00:34:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:39:13] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:20] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:41:20] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:41:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:41:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:41:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:46:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [00:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:59:18] (03PS1) 10Tim Starling: recentchanges: Restore table qualifiers in change tag field expressions [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198185 (https://phabricator.wikimedia.org/T408040) [01:19:45] (03PS1) 10Cwhite: hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198186 (https://phabricator.wikimedia.org/T406774) [01:26:15] (03CR) 10Cwhite: [C:03+2] hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198186 (https://phabricator.wikimedia.org/T406774) (owner: 10Cwhite) [01:34:13] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:42] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone_es (exit_code=99) of es2034.codfw.wmnet onto es2057.codfw.wmnet [01:37:14] (03CR) 10Andrea Denisse: "Hi folks, I'm abandoning this as I've set-up these alerts from Grafana. https://grafana-rw.wikimedia.org/alerting/list?search=namespace%3A" [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [01:37:32] (03Abandoned) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [01:44:13] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:15] FIRING: SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:48:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11301063 (10VRiley-WMF) I'm having issues bringing ms-be1090 back up. Will continue to work on this [01:49:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11301064 (10VRiley-WMF) [01:49:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198178 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [01:49:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198185 (https://phabricator.wikimedia.org/T408040) (owner: 10Tim Starling) [01:53:54] (03Merged) 10jenkins-bot: recentchanges: QueryRateEstimator improvements [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198178 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [01:54:13] (03PS1) 10Cwhite: hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198188 (https://phabricator.wikimedia.org/T406774) [01:55:06] (03CR) 10Cwhite: [C:03+2] hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198188 (https://phabricator.wikimedia.org/T406774) (owner: 10Cwhite) [02:04:54] (03Merged) 10jenkins-bot: recentchanges: Restore table qualifiers in change tag field expressions [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198185 (https://phabricator.wikimedia.org/T408040) (owner: 10Tim Starling) [02:05:51] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1198178|recentchanges: QueryRateEstimator improvements (T403798)]], [[gerrit:1198185|recentchanges: Restore table qualifiers in change tag field expressions (T408040)]] [02:05:58] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [02:05:58] T408040: Wikimedia\Rdbms\DBQueryError: Error 1052: Column 'ct_tag_id' in WHERE is ambiguous - https://phabricator.wikimedia.org/T408040 [02:11:11] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1198178|recentchanges: QueryRateEstimator improvements (T403798)]], [[gerrit:1198185|recentchanges: Restore table qualifiers in change tag field expressions (T408040)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:11:17] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [02:11:18] T408040: Wikimedia\Rdbms\DBQueryError: Error 1052: Column 'ct_tag_id' in WHERE is ambiguous - https://phabricator.wikimedia.org/T408040 [02:17:15] !log tstarling@deploy2002 tstarling: Continuing with sync [02:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:21:22] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198178|recentchanges: QueryRateEstimator improvements (T403798)]], [[gerrit:1198185|recentchanges: Restore table qualifiers in change tag field expressions (T408040)]] (duration: 15m 30s) [02:21:28] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [02:21:28] T408040: Wikimedia\Rdbms\DBQueryError: Error 1052: Column 'ct_tag_id' in WHERE is ambiguous - https://phabricator.wikimedia.org/T408040 [02:24:01] (03PS1) 10Cwhite: hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198189 (https://phabricator.wikimedia.org/T406774) [02:25:07] (03CR) 10Cwhite: [C:03+2] hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198189 (https://phabricator.wikimedia.org/T406774) (owner: 10Cwhite) [02:41:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:20] (03PS1) 10Cwhite: hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198190 (https://phabricator.wikimedia.org/T406774) [02:47:30] (03PS1) 10BCornwall: wikimedia.org: Add Figma domain verification [dns] - 10https://gerrit.wikimedia.org/r/1198191 (https://phabricator.wikimedia.org/T408003) [02:48:03] (03CR) 10Cwhite: [C:03+2] hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198190 (https://phabricator.wikimedia.org/T406774) (owner: 10Cwhite) [03:04:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:09:40] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.389 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:12:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:15:44] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 7.612 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:21:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:24:15] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:26:38] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:35:16] (03PS2) 10Clare Ming: Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) [03:43:53] (03CR) 10Clare Ming: Add config for xLab MW Module experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [03:44:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [03:44:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [03:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:49:49] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408055 (10phaultfinder) 03NEW [03:53:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:58:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:59:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:10:32] FIRING: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:14:13] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:24:13] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:24:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:31:03] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs1011.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [04:31:08] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [04:33:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:42:15] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs1011.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [04:42:20] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [04:52:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:54:40] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.950 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:55:57] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling both afterwards [04:55:58] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling both afterwards [04:56:01] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [04:57:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:58:26] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling both afterwards [04:58:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:06:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:13] FIRING: [6x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling both afterwards [05:09:36] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [05:11:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:14:13] FIRING: [6x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:50] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:34:54] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:37:55] 06SRE, 10Hiddenparma: Distinguish request classes based on user-agent declaration - https://phabricator.wikimedia.org/T408060 (10Joe) 03NEW [05:39:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 2a02:ec80:700:fe0b::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:39:50] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:39:52] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:42:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:44:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and 2a02:ec80:700:fe0b::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:44:15] FIRING: SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:45:24] 06SRE, 10Hiddenparma: FY 25/26 WE 5.4.5: Enforce global rate-limits - https://phabricator.wikimedia.org/T406545#11301321 (10Joe) [05:45:40] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.100 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:45:47] 06SRE, 10Hiddenparma: FY 25/26 WE 5.4.5: Enforce global rate-limits - https://phabricator.wikimedia.org/T406545#11301323 (10Joe) [05:45:59] 06SRE, 10Hiddenparma: Distinguish request classes based on user-agent declaration - https://phabricator.wikimedia.org/T408060#11301326 (10Joe) [05:46:21] 06SRE, 10Hiddenparma: Distinguish request classes based on user-agent declaration - https://phabricator.wikimedia.org/T408060#11301328 (10Joe) p:05Triage→03High [05:48:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2010.codfw.wmnet w/ force delete existing files, repooling both afterwards [05:48:20] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [05:49:05] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling both afterwards [05:49:06] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling both afterwards [05:51:05] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling both afterwards [05:55:34] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062 (10Joe) 03NEW [05:55:45] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11301353 (10Joe) p:05Triage→03High [05:59:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2010.codfw.wmnet w/ force delete existing files, repooling both afterwards [05:59:29] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T0600) [06:00:06] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T0600). [06:03:14] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:31:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:31:57] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2012.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:32:02] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [06:32:24] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2013.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:33:33] (03PS1) 10Slyngshede: data.yaml offboarding amire80 [puppet] - 10https://gerrit.wikimedia.org/r/1198201 [06:37:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [06:38:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [06:39:55] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [06:40:41] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [06:43:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2012.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:43:13] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [06:43:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2013.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:43:33] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:43:41] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:43:53] (03CR) 10Brouberol: WIP: deploy a test OpenSearch cluster in opensearch-ipoid-test ns (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [06:46:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:51:51] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:54:30] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2014.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:54:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2015.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:54:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197642 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [06:54:35] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [06:55:07] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:55:19] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2022.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:58:01] Is anyone able to git fetch/clone from gerrit atm? [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T0700). [07:00:05] cjming and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:02] brouberol: works for me [07:02:11] o/ [07:02:37] I can deploy [07:03:20] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064 (10Jelto) 03NEW [07:03:37] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11301414 (10Jelto) [07:03:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11301416 (10Jelto) [07:04:33] cjming: skipping your patch, I see that your patch is also scheduled for later today, I suppose you did not intend to deploy it during this window [07:05:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:05:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2007.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:06:04] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [07:06:18] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2022.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:06:46] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 9.456 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:07:28] (03PS1) 10Ryan Kemper: wdqs: don't nuke data_loaded file for categ xfer [cookbooks] - 10https://gerrit.wikimedia.org/r/1198206 (https://phabricator.wikimedia.org/T408063) [07:09:18] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2018.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:09:46] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:12:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197642 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:12:13] sigh, it was me being dumb. Nevermind, carry on. [07:12:46] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30051 bytes in 8.985 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:12:55] (03Merged) 10jenkins-bot: cirrus: enable completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197642 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:13:27] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1197642|cirrus: enable completion search with defaultsort A/B test (T404858)]] [07:13:32] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:18:02] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1197642|cirrus: enable completion search with defaultsort A/B test (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:20:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2018.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:20:26] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [07:21:03] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2019.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:24:15] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:58] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2019.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:32:05] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [07:32:54] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2020.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:33:51] (03PS1) 10KartikMistry: cxserver: Remove Yandex MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198213 (https://phabricator.wikimedia.org/T407345) [07:35:50] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-presto1013 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:35:52] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-presto1013 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T408065 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:36:04] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065 (10ops-monitoring-bot) 03NEW [07:43:53] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, sync categories journal) xfer categories from wdqs2008.codfw.wmnet -> wdqs2020.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:44:01] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [07:44:15] !log dcausse@deploy2002 Sync cancelled. [07:44:53] (03CR) 10Elukey: [C:03+1] sre.hardware.upgrade-firmware: improve matching for SSD checks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [07:45:01] (03PS1) 10DCausse: Revert "cirrus: enable completion search with defaultsort A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198215 [07:45:41] (03CR) 10Elukey: [C:03+1] data.yaml offboarding amire80 [puppet] - 10https://gerrit.wikimedia.org/r/1198201 (owner: 10Slyngshede) [07:45:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198215 (owner: 10DCausse) [07:46:39] (03Merged) 10jenkins-bot: Revert "cirrus: enable completion search with defaultsort A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198215 (owner: 10DCausse) [07:47:11] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1198215|Revert "cirrus: enable completion search with defaultsort A/B test"]] [07:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:51:31] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1198215|Revert "cirrus: enable completion search with defaultsort A/B test"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:52:41] !log dcausse@deploy2002 dcausse: Continuing with sync [07:54:07] (03PS1) 10Raunak1709: Fix typo in description field of patternProperties in device-generic.schema [homer/public] - 10https://gerrit.wikimedia.org/r/1198216 (https://phabricator.wikimedia.org/T201491) [07:56:50] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198215|Revert "cirrus: enable completion search with defaultsort A/B test"]] (duration: 09m 38s) [07:58:29] !log closing UTC morning backport window [07:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] dancy and andre: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T0800). [08:00:46] andre: hi, can I fit one more backport in? [08:00:51] if not, I can wait for after the train [08:01:00] (03PS1) 10Dragoniez: jawiki: Add ipblock-exempt to the accountcreator user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198217 (https://phabricator.wikimedia.org/T407855) [08:01:02] kostajh, yes, as there is no train deployment right now [08:01:08] go ahead :) [08:01:11] ok. thanks [08:03:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198217 (https://phabricator.wikimedia.org/T407855) (owner: 10Dragoniez) [08:04:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:12] ok, I'm starting [08:06:18] (03CR) 10Filippo Giunchedi: "Please excuse the drive-by review!" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [08:06:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198131 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [08:10:32] FIRING: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:13] (03PS1) 10Ozge: feat: deploys addalink to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198273 [08:14:42] (03CR) 10Alexandros Kosiaris: [C:03+2] "Gonna deploy this using charlie." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [08:14:50] (03CR) 10CI reject: [V:04-1] Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [08:14:54] !log [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'wdqs1014*' 'systemctl restart wdqs-blazegraph'` (restart service to fix 12 hour deadlock) [08:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:48] andre: no train deployment as in - it's starting in a few minutes, or it's blocked? [08:16:16] Krinkle: next train deployment is in about 10h, it's the UTC evening ride this week [08:16:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11301531 (10Jelto) [08:16:45] andre: I see. It still confuses me that we kep both on the deployment calendar :/ [08:16:59] and https://versions.toolforge.org/ and https://phabricator.wikimedia.org/T405680 look fine to me [08:17:18] well, we may use this very time if things did not work out UTC yesterday evening [08:17:23] but yeah I can see what you mean [08:17:41] (03PS13) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [08:18:33] (03Merged) 10jenkins-bot: Instrument the Suggested investigations feature [extensions/CheckUser] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198131 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [08:19:06] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1198131|Instrument the Suggested investigations feature (T404177)]] [08:19:18] T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177 [08:20:17] RESOLVED: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:23:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:23:11] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1198131|Instrument the Suggested investigations feature (T404177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:24:21] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [08:24:58] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [08:25:22] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [08:25:37] (03PS7) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [08:25:55] (03PS3) 10Effie Mouzeli: site.pp: bye bye mwdebugXXXX 5 [puppet] - 10https://gerrit.wikimedia.org/r/1198035 (https://phabricator.wikimedia.org/T397498) [08:25:57] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [08:26:13] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [08:26:20] (03CR) 10Effie Mouzeli: "Sorted, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1198035 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [08:27:05] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [08:27:35] !log kharlan@deploy2002 kharlan: Continuing with sync [08:28:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:29:14] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: bye bye mwdebugXXXX 5 [puppet] - 10https://gerrit.wikimedia.org/r/1198035 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [08:31:41] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198131|Instrument the Suggested investigations feature (T404177)]] (duration: 12m 35s) [08:31:47] T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177 [08:31:53] (03PS1) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [08:32:26] (03CR) 10Fabfur: [C:03+1] "lgtm! Sorry for the late review" [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [08:33:26] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [08:35:32] andre: ok, I'm done [08:37:02] (03CR) 10CI reject: [V:04-1] Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [08:39:02] (03PS1) 10A smart kitten: cswiktionary: Disable subpages in the main namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198283 (https://phabricator.wikimedia.org/T406728) [08:39:17] (03CR) 10Alexandros Kosiaris: [C:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [08:41:05] can someone e.g. kostajh please review and merge https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1198199 for me? users are complaining [08:41:55] that is a fix for a production error [08:42:40] (03Merged) 10jenkins-bot: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [08:44:03] (03CR) 10Daniel Kinzler: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [08:44:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198283 (https://phabricator.wikimedia.org/T406728) (owner: 10A smart kitten) [08:46:46] I'm escalating it to UBN and train blocker task [08:46:59] TimStarling: +2'ed [08:47:08] thanks [08:47:33] (03PS1) 10Tim Starling: recentchanges: Fix incorrect alias in isDenseTagFilter [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198285 (https://phabricator.wikimedia.org/T408040) [08:49:04] (03PS17) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [08:49:04] (03PS18) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [08:49:05] (03PS17) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [08:49:05] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [08:49:05] (03PS2) 10Btullis: Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) [08:49:13] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [08:49:20] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [08:49:28] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [08:49:37] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [08:49:45] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [08:49:51] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [08:49:53] is it OK to go ahead with scap backport for that? I know we're in the train window but as I say I am not recommending going ahead with the train without the bugfix [08:49:59] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [08:50:05] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [08:50:13] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [08:51:20] I will take that minute of silence as a yes [08:51:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198285 (https://phabricator.wikimedia.org/T408040) (owner: 10Tim Starling) [08:51:55] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7397/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [08:52:53] (03CR) 10Btullis: Pin the version of opensearch-dashboards wherever it is used (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [08:54:06] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11301787 (10fgiunchedi) My two cents re: function/naming, yes +1 to keep it as generic as we reasonably can for all non-http proxying uses we m... [08:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:54:25] (03CR) 10Daniel Kinzler: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [08:56:03] (03PS3) 10Daniel Kinzler: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 [08:56:53] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [08:57:02] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [08:57:16] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [08:57:25] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [08:57:40] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [08:57:50] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [08:57:58] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [08:58:07] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [08:58:33] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [08:58:43] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [08:58:56] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [08:59:00] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [08:59:39] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [08:59:43] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [09:00:22] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [09:00:30] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [09:01:24] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [09:01:29] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [09:01:41] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [09:01:45] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [09:07:19] (03Merged) 10jenkins-bot: recentchanges: Fix incorrect alias in isDenseTagFilter [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198285 (https://phabricator.wikimedia.org/T408040) (owner: 10Tim Starling) [09:07:52] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1198285|recentchanges: Fix incorrect alias in isDenseTagFilter (T408040)]] [09:07:57] T408040: Wikimedia\Rdbms\DBQueryError: Error 1052: Column 'ct_tag_id' in WHERE is ambiguous - https://phabricator.wikimedia.org/T408040 [09:09:43] (03CR) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [09:09:50] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [09:10:27] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [09:10:46] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [09:11:41] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [09:11:53] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [09:11:56] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1198285|recentchanges: Fix incorrect alias in isDenseTagFilter (T408040)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:12:37] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [09:14:38] !log tstarling@deploy2002 tstarling: Continuing with sync [09:15:42] TimStarling: looks like the patch fails sqlite and postgres tests [09:17:49] annoying [09:18:12] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [09:18:24] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [09:18:39] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [09:18:43] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198285|recentchanges: Fix incorrect alias in isDenseTagFilter (T408040)]] (duration: 10m 51s) [09:18:48] it's not an actual failure, it's just because there's a $dbType === 'mysql' condition which is not reflected in the test [09:18:53] T408040: Wikimedia\Rdbms\DBQueryError: Error 1052: Column 'ct_tag_id' in WHERE is ambiguous - https://phabricator.wikimedia.org/T408040 [09:18:58] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [09:19:14] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [09:19:28] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [09:19:39] I mean it is a test problem not a reality problem [09:21:47] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [09:22:06] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [09:22:57] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [09:23:12] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [09:23:37] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [09:24:09] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [09:25:12] PS3 is up now, it just skips the test on those DBs [09:25:40] I'm assuming we don't need a separate commit in the deployment branch since it doesn't break CI there, PG/SQLite tests only run against master [09:26:26] anyway, sorry about that, hopefully PS3 will merge now [09:26:28] (03PS18) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [09:26:28] (03PS3) 10Btullis: Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) [09:26:28] (03PS1) 10Btullis: Add a new apt component for our custom elasticsearch-curator build [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) [09:27:05] (03CR) 10CI reject: [V:04-1] Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [09:29:23] (03CR) 10Btullis: Change the component from where we install elasticsearch-curator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [09:32:24] (03PS2) 10Btullis: Update the apt components used for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) [09:32:24] (03PS4) 10Btullis: Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) [09:33:09] !log akosiaris@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [09:33:54] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7398/console" [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [09:36:06] !skip updating datahub for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1197304, too much of a bump [09:37:00] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2034.codfw.wmnet [09:37:01] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2034.codfw.wmnet [09:38:15] (03CR) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [09:39:17] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [09:39:38] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [09:39:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2034 gradually with 4 steps - Pooling in [09:40:35] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:41:30] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [09:41:36] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:43:36] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [09:43:55] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:44:16] FIRING: SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:45:06] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [09:46:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [09:46:46] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [09:50:10] (03PS1) 10Federico Ceratto: instances.yaml, es2057.yaml: Prepare for production [puppet] - 10https://gerrit.wikimedia.org/r/1198290 (https://phabricator.wikimedia.org/T402859) [09:53:06] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [09:54:49] (03PS2) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [09:54:51] (03Merged) 10jenkins-bot: rest-gateway: Deploy rate limiting in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194174 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [09:54:54] (03PS1) 10Alexandros Kosiaris: misweb: Add main_app.volumes back in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198292 [09:54:54] (03PS1) 10Alexandros Kosiaris: Remove wmf.volumes from aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198293 [09:56:52] (03CR) 10Slyngshede: [C:03+2] data.yaml offboarding amire80 [puppet] - 10https://gerrit.wikimedia.org/r/1198201 (owner: 10Slyngshede) [09:58:49] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1089.eqiad.wmnet with OS bullseye [09:58:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11301979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1089.eqiad.wmnet with OS bullseye [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1000) [10:01:43] (03CR) 10Alexandros Kosiaris: [C:03+2] "CI checks out, merging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198292 (owner: 10Alexandros Kosiaris) [10:02:52] (03PS1) 10DCausse: Revert^2 "cirrus: enable completion search with defaultsort A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198295 [10:03:08] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Amire80 out of all services on: 2412 hosts [10:03:48] (03Merged) 10jenkins-bot: misweb: Add main_app.volumes back in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198292 (owner: 10Alexandros Kosiaris) [10:09:23] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [10:10:20] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [10:10:29] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [10:11:13] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1089.eqiad.wmnet with reason: host reimage [10:12:31] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [10:13:10] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [10:14:29] (03PS2) 10Alexandros Kosiaris: Remove wmf.volumes from aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198293 [10:14:55] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [10:15:08] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1089.eqiad.wmnet with reason: host reimage [10:16:44] !log akosiaris@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [10:17:06] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:17:17] !log akosiaris@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [10:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:19:54] !log akosiaris@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [10:20:17] !log akosiaris@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [10:23:11] (03PS3) 10Majavah: P:toolforge: Remove separate proxy role [puppet] - 10https://gerrit.wikimedia.org/r/1198050 (https://phabricator.wikimedia.org/T283948) [10:23:11] (03PS3) 10Majavah: P:toolforge: Remove long-obsolete proxylistener systemd unit code [puppet] - 10https://gerrit.wikimedia.org/r/1198051 [10:23:11] (03PS1) 10Majavah: clean-stale-puppet-certs: Remove nodes from PuppetDB where enabled [puppet] - 10https://gerrit.wikimedia.org/r/1198299 [10:24:16] (03CR) 10CI reject: [V:04-1] clean-stale-puppet-certs: Remove nodes from PuppetDB where enabled [puppet] - 10https://gerrit.wikimedia.org/r/1198299 (owner: 10Majavah) [10:25:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2034 gradually with 4 steps - Pooling in [10:25:23] (03PS2) 10Majavah: clean-stale-puppet-certs: Remove nodes from PuppetDB where enabled [puppet] - 10https://gerrit.wikimedia.org/r/1198299 [10:26:00] (03CR) 10Hnowlan: [C:03+1] Update /page/ lint routes to use the new rest.php endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197731 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [10:28:17] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:29:17] FIRING: [16x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:34] (03PS3) 10Alexandros Kosiaris: Remove wmf.volumes from aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198293 [10:31:29] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:31:43] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1089.eqiad.wmnet with OS bullseye [10:31:49] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:32:07] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11302106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1089.eqiad.wmnet with OS bullseye complete... [10:32:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:34:13] FIRING: [2x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:34:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:36:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:36:51] (03CR) 10Alexandros Kosiaris: [C:03+2] "CI looks good now, this is the last one in the cleanup, merging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198293 (owner: 10Alexandros Kosiaris) [10:39:55] (03Merged) 10jenkins-bot: Remove wmf.volumes from aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198293 (owner: 10Alexandros Kosiaris) [10:41:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:41:28] (03PS1) 10Btullis: Add a thirparty/documentdb component to reprepro for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1198300 (https://phabricator.wikimedia.org/T408085) [10:43:27] (03PS6) 10Gmodena: blazegraph: add cluster sync check [alerts] - 10https://gerrit.wikimedia.org/r/1174723 (https://phabricator.wikimedia.org/T408026) [10:44:13] RESOLVED: [2x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:45:37] (03PS5) 10Elukey: WIP: sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 [10:45:45] (03PS1) 10Clément Goubert: rest-gateway: Fix ratelimit service redis port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198301 [10:46:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:46:09] (03PS5) 10Neslihan Turan: Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [10:47:37] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:47:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:48:31] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:49:41] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix ratelimit service redis port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198301 (owner: 10Clément Goubert) [10:50:01] (03CR) 10Nikerabbit: [C:03+1] cxserver: Remove Yandex MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198213 (https://phabricator.wikimedia.org/T407345) (owner: 10KartikMistry) [10:50:13] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS trixie [10:51:30] (03Merged) 10jenkins-bot: rest-gateway: Fix ratelimit service redis port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198301 (owner: 10Clément Goubert) [10:53:03] RECOVERY - Host ms-be2078 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [10:56:13] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:56:28] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:57:36] (03PS2) 10Btullis: Add a thirparty/documentdb component to reprepro for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1198300 (https://phabricator.wikimedia.org/T408085) [10:58:35] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7399/console" [puppet] - 10https://gerrit.wikimedia.org/r/1198300 (https://phabricator.wikimedia.org/T408085) (owner: 10Btullis) [10:59:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:01:13] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:01:37] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:01:43] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:01:53] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:01:57] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:02:01] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [11:03:36] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:03:40] 06SRE, 10SRE-Access-Requests: replace ssh keys with yubikey-backed key for Daniel Z - https://phabricator.wikimedia.org/T407917#11302223 (10Raine) 05Open→03Resolved [11:03:50] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:06:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:06:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:08:28] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [11:10:07] (03PS1) 10Daniel Kinzler: ~daniel/.screenrc: force login shell [puppet] - 10https://gerrit.wikimedia.org/r/1198305 (https://phabricator.wikimedia.org/T404739) [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:13:26] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:22:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:24:16] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:52] (03CR) 10Brouberol: [C:03+1] Add a thirparty/documentdb component to reprepro for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1198300 (https://phabricator.wikimedia.org/T408085) (owner: 10Btullis) [11:24:57] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:25:09] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:26:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:27:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:28:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:31:06] (03PS6) 10Elukey: sre.hosts.provision: remove boot order config in UEFI for Dells [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 (https://phabricator.wikimedia.org/T406964) [11:33:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:37:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:45:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11302362 (10elukey) Thanks a lot for the firmware upgrades! I'll check what's wrong with the cookbook, afaics it seems something re... [11:46:20] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11302386 (10elukey) >>! In T404356#11263414, @elukey wrote: > > The next step is to test multiple reimages on ms-be2078 and see if w... [11:47:52] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2078.codfw.wmnet with OS trixie [11:48:18] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:49:11] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:51:16] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:51:26] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:54:00] (03CR) 10Btullis: [V:03+1 C:03+2] Add a thirparty/documentdb component to reprepro for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1198300 (https://phabricator.wikimedia.org/T408085) (owner: 10Btullis) [11:54:13] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:56:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:56:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:58:33] (03CR) 10Jcrespo: [C:03+1] instances.yaml, es2057.yaml: Prepare for production [puppet] - 10https://gerrit.wikimedia.org/r/1198290 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:59:13] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1200) [12:04:04] (03CR) 10Brouberol: [C:03+1] Update the apt components used for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [12:05:41] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml, es2057.yaml: Prepare for production [puppet] - 10https://gerrit.wikimedia.org/r/1198290 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:07:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:11:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:12:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:15:12] Deploying cxserver. [12:15:42] (03PS2) 10KartikMistry: cxserver: Remove Yandex MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198213 (https://phabricator.wikimedia.org/T407345) [12:16:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:16:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:18:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11302562 (10Jclark-ctr) Sorry for not updating yesterday @elukey 2 servers yesterday we where talking about Getting the same errors... [12:18:43] (03CR) 10KartikMistry: [C:03+2] cxserver: Remove Yandex MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198213 (https://phabricator.wikimedia.org/T407345) (owner: 10KartikMistry) [12:20:31] (03Merged) 10jenkins-bot: cxserver: Remove Yandex MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198213 (https://phabricator.wikimedia.org/T407345) (owner: 10KartikMistry) [12:20:50] (03PS1) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [12:22:02] (03PS2) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [12:22:35] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:22:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet-wmde - https://phabricator.wikimedia.org/T407605#11302570 (10Raine) [12:22:57] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:22:59] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet-wmde - https://phabricator.wikimedia.org/T407605#11302573 (10Raine) @Virginie.caplet I assume your developer account username is vicaplet-wmde, correcting it. [12:25:36] (03CR) 10Clément Goubert: "I don't like the double negative, I'm going to flip the boolean." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [12:26:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:28:20] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11302589 (10Jclark-ctr) a:03Jclark-ctr This server is currently out of warranty we do have spare 4TB drives on hand we can install please advise when we can replace. [12:30:25] (03PS1) 10Jcrespo: transferpy: Type hints, reduced cyclomatic complexity and overal cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198314 (https://phabricator.wikimedia.org/T393692) [12:30:46] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:31:17] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:31:51] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:32:25] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:32:32] (03PS3) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [12:32:55] !log cxserver: Remove Yandex MT service (T407345) [12:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:59] T407345: cxserver: Yandex MT service failure - https://phabricator.wikimedia.org/T407345 [12:33:04] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet-wmde - https://phabricator.wikimedia.org/T407605#11302636 (10Raine) Noting that for analytics-privatedata-users, "Explicit approval is not required for WMF or WMDE Staff." (T381824, T370424). [12:33:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet-wmde - https://phabricator.wikimedia.org/T407605#11302640 (10Raine) [12:35:23] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11302643 (10Jclark-ctr) updated idrac firmware while logged in to 7.00.00.182 from 5.10.10.00 [12:36:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11302644 (10Jclark-ctr) [12:38:38] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad racks for variance from Netbox - https://phabricator.wikimedia.org/T407851#11302651 (10Jclark-ctr) @VRiley-WMF I have updated the Unit location for es1057 and updated the model description for dbprov1007 can you remove the 2x servers you missed while performing dec... [12:41:56] (03PS1) 10KartikMistry: Update Recommendation API to 2025-10-22-134201-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198319 (https://phabricator.wikimedia.org/T407895) [12:43:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11302680 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr removed cable and updated netbox [12:45:08] and updating recommendation-api as well.. [12:45:48] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:46:29] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-10-22-134201-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198319 (https://phabricator.wikimedia.org/T407895) (owner: 10KartikMistry) [12:46:36] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:46:54] (03PS1) 10Kamila Součková: admin: add vicaplet-wmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1198320 (https://phabricator.wikimedia.org/T407605) [12:48:05] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-10-22-134201-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198319 (https://phabricator.wikimedia.org/T407895) (owner: 10KartikMistry) [12:48:57] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:50:06] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:52:29] (03CR) 10Ssingh: [C:03+1] wikimedia.org: Add Figma domain verification [dns] - 10https://gerrit.wikimedia.org/r/1198191 (https://phabricator.wikimedia.org/T408003) (owner: 10BCornwall) [12:52:52] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408055#11302702 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Found L1 and L2 at 12 or near reblanced pdu moving more to L3 leg. all legs are under 10 now [12:53:44] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:55:09] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11302712 (10Raine) @SKaram-WMF just to confirm, do you need SSH access or only dashboards and such? Thanks! [12:55:39] (03CR) 10Giuseppe Lavagetto: "I don't think this needs any review given the very generic nature of the score." [puppet] - 10https://gerrit.wikimedia.org/r/1198130 (owner: 10CDanis) [12:56:26] (03PS18) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [12:56:26] (03PS19) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [12:56:26] (03PS19) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [12:56:27] (03PS3) 10Btullis: Update the apt components used for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) [12:56:28] (03PS5) 10Btullis: Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) [12:57:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11302715 (10Raine) >>! In T407094#11302712, @Raine wrote: > @SKaram-WMF just to confirm, do you need SSH access or only dashboards and such? Thanks! My bad, I see you reque... [12:57:31] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:57:35] (03CR) 10CI reject: [V:04-1] Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [12:57:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:57:47] (03CR) 10CDanis: haproxy: x-is-browser: --> Data Lake (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198130 (owner: 10CDanis) [12:58:02] (03CR) 10CDanis: [C:03+2] haproxy: x-is-browser: --> Data Lake [puppet] - 10https://gerrit.wikimedia.org/r/1198130 (owner: 10CDanis) [12:58:35] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:58:47] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie [12:58:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11302717 (10Raine) [12:59:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7400/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [12:59:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11302726 (10elukey) Provisioned both nodes with `--no-user --no-dhcp --no-switch` and they worked. Trying to reimage sretest1005 now :) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1300). [13:00:05] MatmaRex, Dragoniez, and A_smart_kitten: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] !log Update Recommendation API to 2025-10-22-134201-production (T407895, T407894) [13:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:17] T407895: Recommendation api should not show page collections with no articles - https://phabricator.wikimedia.org/T407895 [13:00:17] T407894: Recommendation api should remove untagged page collections - https://phabricator.wikimedia.org/T407894 [13:00:20] o/ [13:00:20] here o/ [13:00:26] o/ [13:00:35] hey [13:01:08] I can deploy! [13:02:16] let’s start with the OAuth config change [13:02:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [13:03:27] (03PS4) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [13:03:38] (03Merged) 10jenkins-bot: Add virtual domain mapping for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [13:04:10] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1196441|Add virtual domain mapping for OAuth (T348485)]] [13:04:11] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit for deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198119 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:04:15] T348485: Migrate OAuth to use a virtual database domain - https://phabricator.wikimedia.org/T348485 [13:04:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit for deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198120 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:04:45] (03PS1) 10Brouberol: kubernetes: register the postgresql 17-documentdb image into the common images [puppet] - 10https://gerrit.wikimedia.org/r/1198324 (https://phabricator.wikimedia.org/T406578) [13:06:16] (03CR) 10Btullis: kubernetes: register the postgresql 17-documentdb image into the common images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198324 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [13:06:59] (03PS1) 10Kamila Součková: admin: add skaramwmf to analytics-private-data-users [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) [13:07:00] (03PS2) 10Brouberol: kubernetes: register the postgresql 17-documentdb image into the common images [puppet] - 10https://gerrit.wikimedia.org/r/1198324 (https://phabricator.wikimedia.org/T406578) [13:07:01] (03PS4) 10Klausman: team-ml: Change helmfile_admin_ng_pending_changes alert to fire after 1w [alerts] - 10https://gerrit.wikimedia.org/r/1198321 (https://phabricator.wikimedia.org/T403047) [13:07:34] (03CR) 10Ssingh: "Looking good! Let's remove the override for dns1004 and run PCC for dns1004, doh1001 and happy to +1 then." [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [13:07:47] (03CR) 10Brouberol: kubernetes: register the postgresql 17-documentdb image into the common images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198324 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [13:08:36] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Backport for [[gerrit:1196441|Add virtual domain mapping for OAuth (T348485)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:12] MatmaRex: please test! [13:10:06] looking [13:10:07] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet [13:10:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:43] Lucas_WMDE: looks good [13:10:50] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Continuing with sync [13:10:52] thanks! [13:11:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:11:06] MatmaRex: remind me, how long is that maintenance script expected to run? [13:11:20] bc I’m on holiday from tomorrow, so if it’s more than a few hours, ideally someone else should run it ^^ [13:11:37] Lucas_WMDE: hopefully no more than a few hours [13:11:43] (03PS5) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [13:11:47] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Check for same log_actor between local and global log entry [extensions/CentralAuth] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198119 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:11:49] ok, then we can give it a try [13:12:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:13:56] and we can deploy config changes together with those backports, as the backports should be totally safe since they only touch a maint script [13:14:07] pondering whether to deploy the changes for Dragoniez and A_smart_kitten together or separately [13:14:10] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Check for same log_actor between local and global log entry [extensions/CentralAuth] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198120 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:14:24] honestly I don’t see a need to separate them. might as well combine them [13:14:40] (03CR) 10Brouberol: [C:03+2] kubernetes: register the postgresql 17-documentdb image into the common images [puppet] - 10https://gerrit.wikimedia.org/r/1198324 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [13:14:59] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196441|Add virtual domain mapping for OAuth (T348485)]] (duration: 10m 48s) [13:15:04] T348485: Migrate OAuth to use a virtual database domain - https://phabricator.wikimedia.org/T348485 [13:15:25] Lucas_WMDE: personally I'm fine with either; am I right in thinking that the risk with deploying them together is in case one of them somehow breaks the wikis and needs to be reverted (and the other therefore needs to be redeployed?) [13:15:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:34] pretty much yeah [13:15:50] it can also make it slightly harder to figure out which change is responsible for an isseu [13:15:52] I'm good with combinatory deploy and fine with either [13:16:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198217 (https://phabricator.wikimedia.org/T407855) (owner: 10Dragoniez) [13:16:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198283 (https://phabricator.wikimedia.org/T406728) (owner: 10A smart kitten) [13:16:18] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:16:39] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2014.codfw.wmnet [13:16:46] (*issue) [13:17:08] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [13:17:16] (03Merged) 10jenkins-bot: jawiki: Add ipblock-exempt to the accountcreator user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198217 (https://phabricator.wikimedia.org/T407855) (owner: 10Dragoniez) [13:17:19] (03Merged) 10jenkins-bot: cswiktionary: Disable subpages in the main namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198283 (https://phabricator.wikimedia.org/T406728) (owner: 10A smart kitten) [13:17:51] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1198119|FixRenameUserLocalLogs: Check for same log_actor between local and global log entry (T398177)]], [[gerrit:1198120|FixRenameUserLocalLogs: Check for same log_actor between local and global log entry (T398177)]], [[gerrit:1198217|jawiki: Add ipblock-exempt to the accountcreator user group (T407855)]], [[gerrit:1198283|cswiktionary: [13:17:51] Disable subpages in the main namespace (T406728)]] [13:17:58] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:17:58] T407855: jawiki: Add ipblock-exempt to the accountcreator user group - https://phabricator.wikimedia.org/T407855 [13:17:59] T406728: [cswiktionary] Disable subpages in the main namespace - https://phabricator.wikimedia.org/T406728 [13:18:03] and that’s the other hazard of deploying too much at once :) [13:18:26] !log (cont) Started scap sync-world: Backport for … [[gerrit:1198283|cswiktionary:Disable subpages in the main namespace (T406728)]] [13:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:16] tbh i've thought about filing a task before to see if there's any way that scap & stashbot can somehow deal with long backport SAL logs a bit more gracefully [13:19:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:18] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:20:23] it feels like it should be possible in some way, but it doesn't seem like it'd be the simplest thing to implement. [but i'm getting off-topic] [13:21:52] (03CR) 10Agamyasamuel: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/1198216 (https://phabricator.wikimedia.org/T201491) (owner: 10Raunak1709) [13:22:16] !log lucaswerkmeister-wmde@deploy2002 asmartkitten, lucaswerkmeister-wmde, matmarex, dragoniez: Backport for [[gerrit:1198119|FixRenameUserLocalLogs: Check for same log_actor between local and global log entry (T398177)]], [[gerrit:1198120|FixRenameUserLocalLogs: Check for same log_actor between local and global log entry (T398177)]], [[gerrit:1198217|jawiki: Add ipblock-exempt to the accountcreator user group (T407855)]] [13:22:16] , [[gerrit:1198283|cswiktionary: Disable subpages in the main namespace (T406728)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:22:51] Looks all good for my patch [13:22:56] looking [13:23:17] looks good for mine :) [13:23:21] !log lucaswerkmeister-wmde@deploy2002 asmartkitten, lucaswerkmeister-wmde, matmarex, dragoniez: Continuing with sync [13:23:43] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet [13:26:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:26:55] there’s a new spike in logspam-watch [13:27:02] but I think I remember seeing the same errors already an hour ago [13:27:04] (also as a spike) [13:27:09] maybe something just runs hourly as a timer [13:27:15] * Lucas_WMDE searches [13:27:32] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198119|FixRenameUserLocalLogs: Check for same log_actor between local and global log entry (T398177)]], [[gerrit:1198120|FixRenameUserLocalLogs: Check for same log_actor between local and global log entry (T398177)]], [[gerrit:1198217|jawiki: Add ipblock-exempt to the accountcreator user group (T407855)]], [[gerrit:1198283|cswiktionary: [13:27:32] Disable subpages in the main namespace (T406728)]] (duration: 09m 41s) [13:27:39] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:27:40] T407855: jawiki: Add ipblock-exempt to the accountcreator user group - https://phabricator.wikimedia.org/T407855 [13:27:40] T406728: [cswiktionary] Disable subpages in the main namespace - https://phabricator.wikimedia.org/T406728 [13:27:53] !log (cont.) Finished scap sync-world: Backport for … [[gerrit:1198283|cswiktionary:Disable subpages in the main namespace (T406728)]] (duration: 09m 41s) [13:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:06] yeah one of the errors is T408052 [13:28:06] T408052: PHP Warning: Trying to access array offset on value of type null (via GrowthExperiments listTaskCounts) - https://phabricator.wikimedia.org/T408052 [13:28:39] thanks for deploying Lucas_WMDE! [13:28:45] np :) [13:28:48] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS trixie [13:28:59] !log UTC afternoon backport+config window done [13:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:08] Lucas_WMDE: Thanks for your help :) [13:29:11] (the maintenance script will run longer, no point in keeping the window open for that I think) [13:29:17] RESOLVED: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:17] !log bking@cumin2002 `sudo cumin 'A:wdqs-main and A:codfw' 'depool ; systemctl restart wdqs-blazegraph ; sleep 30 ; pool'` [13:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:29] MatmaRex: does this command look okay to you? [13:30:30] mwscript-k8s --comment='T398177 (dry run)' --follow --sal --dblist=sul -- CentralAuth:FixRenameUserLocalLogs --logwiki=metawiki --batch-size=25 | tee ~/T398177-run5.dry [13:30:40] that’s the same as last time (I think) except for the added --batch-size=25 [13:30:43] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie [13:30:56] Lucas_WMDE: yep [13:31:02] RESOLVED: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:31:21] (03CR) 10JHathaway: "I think that would probably work for `net.netfilter.nf_conntrack_max`, but in the general case modules may be loaded or unloaded and reloa" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [13:31:23] just checking the IRC logs to see if I complained about having forgotten anything last time ^^ [13:31:59] doesn’t look like it [13:32:11] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11302844 (10MatthewVernon) megacli might have been copied to trixie, but it's useless there, because as you say it depends upon libncurses5, w... [13:32:14] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist sul CentralAuth:FixRenameUserLocalLogs --logwiki=metawiki --batch-size=25 # T398177 (dry run) [13:32:17] (03PS6) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [13:32:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11302846 (10Papaul) @elukey no problem [13:33:14] (03CR) 10Herron: [C:03+1] thanos::rule: Cleanup firewall handling [puppet] - 10https://gerrit.wikimedia.org/r/1197590 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [13:33:47] (03PS1) 10Brouberol: kubernetes: postgresql keys must be integers-like [puppet] - 10https://gerrit.wikimedia.org/r/1198327 (https://phabricator.wikimedia.org/T406578) [13:34:01] (03CR) 10Majavah: [V:03+1 C:03+2] thanos::rule: Cleanup firewall handling [puppet] - 10https://gerrit.wikimedia.org/r/1197590 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [13:34:16] thanks Lucas_WMDE [13:34:16] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::metricsinfra: Fix thanos::rule usage [puppet] - 10https://gerrit.wikimedia.org/r/1197591 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [13:34:25] (03CR) 10Brouberol: [C:03+2] kubernetes: postgresql keys must be integers-like [puppet] - 10https://gerrit.wikimedia.org/r/1198327 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [13:35:37] (03CR) 10Btullis: [C:03+1] kubernetes: register the postgresql 17-documentdb image into the common images [puppet] - 10https://gerrit.wikimedia.org/r/1198324 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [13:37:19] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11302887 (10MatthewVernon) [trixie does have the unofficial https://packages.debian.org/stable/admin/megactl but I don't know if a) that works... [13:38:08] (03PS1) 10Jelto: gerrit: block one more UA [puppet] - 10https://gerrit.wikimedia.org/r/1198328 (https://phabricator.wikimedia.org/T365259) [13:39:09] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [13:39:17] (03PS7) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [13:39:27] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [13:39:41] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS trixie [13:39:52] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [13:40:11] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [13:40:26] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [13:40:27] (03CR) 10Jelto: [C:03+2] gerrit: block one more UA [puppet] - 10https://gerrit.wikimedia.org/r/1198328 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [13:40:42] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [13:40:58] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [13:41:17] (03CR) 10CI reject: [V:04-1] api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [13:41:18] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [13:41:32] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [13:41:51] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [13:42:03] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [13:42:19] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [13:42:31] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [13:42:46] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [13:42:59] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [13:43:17] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [13:43:27] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [13:43:40] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [13:43:54] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [13:44:13] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [13:44:33] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [13:44:52] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [13:44:54] (03PS8) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [13:45:09] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [13:45:22] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [13:45:39] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [13:46:08] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [13:46:25] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [13:46:43] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [13:47:08] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [13:47:21] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [13:47:42] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply [13:48:00] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [13:48:29] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11302959 (10elukey) >>! In T407513#11302844, @MatthewVernon wrote: > megacli might have been copied to trixie, but it's useless there, because... [13:49:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [13:49:54] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [13:50:14] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [13:50:34] (03PS1) 10Btullis: Add dse-k8s-worker2003 to the kubesvc pool [puppet] - 10https://gerrit.wikimedia.org/r/1198329 (https://phabricator.wikimedia.org/T406985) [13:51:12] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [13:51:23] (03PS9) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [13:51:24] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [13:51:42] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [13:52:00] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [13:52:19] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [13:52:39] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [13:52:41] (03CR) 10CDanis: [C:03+1] "Thanks Jelto! This is a good start, but we will also need to make a bunch of modifications to modules/profile/{templates,manifests}/servic" [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [13:52:52] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [13:53:05] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [13:53:25] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [13:53:43] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [13:54:09] (03CR) 10Ssingh: [C:03+1] Add dse-k8s-worker2003 to the kubesvc pool [puppet] - 10https://gerrit.wikimedia.org/r/1198329 (https://phabricator.wikimedia.org/T406985) (owner: 10Btullis) [13:54:29] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [13:54:51] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [13:55:57] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [13:56:14] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [13:56:54] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [13:57:12] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [13:57:25] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:57:25] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [13:57:51] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [13:58:00] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [13:58:17] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [13:59:26] (03CR) 10Btullis: [C:03+2] Add dse-k8s-worker2003 to the kubesvc pool [puppet] - 10https://gerrit.wikimedia.org/r/1198329 (https://phabricator.wikimedia.org/T406985) (owner: 10Btullis) [13:59:42] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:00:27] (03CR) 10JHathaway: [C:03+1] "looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 (https://phabricator.wikimedia.org/T406964) (owner: 10Elukey) [14:02:07] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: remove boot order config in UEFI for Dells [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 (https://phabricator.wikimedia.org/T406964) (owner: 10Elukey) [14:03:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [14:03:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [14:06:54] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:13:54] (03PS1) 10Brouberol: kubernetes: update postgresql 17 [puppet] - 10https://gerrit.wikimedia.org/r/1198334 (https://phabricator.wikimedia.org/T406578) [14:14:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:14:45] (03CR) 10Brouberol: [C:03+2] kubernetes: update postgresql 17 [puppet] - 10https://gerrit.wikimedia.org/r/1198334 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [14:14:54] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS trixie [14:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:26:10] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1430) [14:31:41] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [14:37:09] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:38:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [14:40:35] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [14:40:40] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [14:40:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11303180 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [14:41:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11303181 (10elukey) The hosts needed to be uefi-provisioned, and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1194892 needed... [14:41:40] !log stevemunene@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/echoserver: apply [14:42:33] !log stevemunene@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/echoserver: apply [14:43:22] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2002.codfw.wmnet with OS trixie [14:48:27] 06SRE, 10SRE-Access-Requests: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11303204 (10Raine) [14:48:43] (03PS1) 10CDanis: haproxy: ja4h: --> global [puppet] - 10https://gerrit.wikimedia.org/r/1198339 (https://phabricator.wikimedia.org/T406990) [14:48:56] 06SRE, 10SRE-Access-Requests: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11303208 (10Raine) a:03DMburugu @DMburugu can you please approve this? Thanks! [14:49:09] (03CR) 10FNegri: [C:03+1] P:toolforge: Remove long-obsolete proxylistener systemd unit code [puppet] - 10https://gerrit.wikimedia.org/r/1198051 (owner: 10Majavah) [14:49:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11303211 (10elukey) >>! In T406964#11300946, @Papaul wrote: > While trying to use the firmware upgrade cookbook with "sudo cookbook sre.hardware.upgrade-... [14:49:39] (03CR) 10FNegri: [C:03+1] P:toolforge: Remove separate proxy role [puppet] - 10https://gerrit.wikimedia.org/r/1198050 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [14:49:56] (03CR) 10Majavah: [C:03+2] P:toolforge: Remove separate proxy role [puppet] - 10https://gerrit.wikimedia.org/r/1198050 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [14:49:58] (03PS1) 10Kamila Součková: admin: add urbanecm to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/1198340 (https://phabricator.wikimedia.org/T408008) [14:50:09] (03CR) 10Majavah: [C:03+2] P:toolforge: Remove long-obsolete proxylistener systemd unit code [puppet] - 10https://gerrit.wikimedia.org/r/1198051 (owner: 10Majavah) [14:50:30] (03CR) 10Fabfur: [C:03+1] haproxy: ja4h: --> global [puppet] - 10https://gerrit.wikimedia.org/r/1198339 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [14:50:33] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Let's maybe wait for Moritz's thoughts too. Probably there is some systemd dependency way to deal with it, but also sysctl's need " [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [14:50:34] (03CR) 10Kamila Součková: [C:04-2] "DNM: waiting for manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/1198340 (https://phabricator.wikimedia.org/T408008) (owner: 10Kamila Součková) [14:50:41] (03PS2) 10CDanis: haproxy: ja4h: --> global [puppet] - 10https://gerrit.wikimedia.org/r/1198339 (https://phabricator.wikimedia.org/T406990) [14:50:44] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198339 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [14:53:12] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "run sync to add new nokia switches - cmooney@cumin1003 - T405558" [14:53:17] T405558: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558 [14:53:28] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "run sync to add new nokia switches - cmooney@cumin1003 - T405558" [14:53:36] (03PS1) 10Kamila Součková: admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) [14:54:18] (03CR) 10Kamila Součková: [C:04-2] "DNM, waiting for approval" [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [14:54:28] (03CR) 10CDanis: [C:03+2] haproxy: ja4h: --> global [puppet] - 10https://gerrit.wikimedia.org/r/1198339 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [14:57:03] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [14:57:17] (03PS1) 10Andrew Bogott: preseed: (TEMPORARY) switch maps-test hosts to raid 10 [puppet] - 10https://gerrit.wikimedia.org/r/1198344 (https://phabricator.wikimedia.org/T407586) [14:59:16] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:17] (03CR) 10CI reject: [V:04-1] preseed: (TEMPORARY) switch maps-test hosts to raid 10 [puppet] - 10https://gerrit.wikimedia.org/r/1198344 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [14:59:34] (03CR) 10Filippo Giunchedi: "> I think that would probably work for `net.netfilter.nf_conntrack_max`, but in the general case modules may be loaded or unloaded and rel" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [15:00:05] dancy and andre: Time to snap out of that daydream and deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1500). [15:00:07] elukey@cumin1003 reimage (PID 3441211) is awaiting input [15:00:18] (03CR) 10Filippo Giunchedi: [C:03+1] preseed: (TEMPORARY) switch maps-test hosts to raid 10 [puppet] - 10https://gerrit.wikimedia.org/r/1198344 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [15:02:54] (03CR) 10DamianZaremba: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1198344 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [15:02:59] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [15:03:01] (03CR) 10DamianZaremba: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198344 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [15:03:47] 06SRE, 10SRE-Access-Requests: replace ssh keys with yubikey-backed key for Daniel Z - https://phabricator.wikimedia.org/T407917#11303305 (10Dzahn) 05Resolved→03Open Thanks! Just still need to remove my old key. [15:04:22] (03PS1) 10Majavah: P:toolforge: Remove obsolete spec test [puppet] - 10https://gerrit.wikimedia.org/r/1198349 (https://phabricator.wikimedia.org/T283948) [15:05:01] (03CR) 10Andrew Bogott: [C:03+1] P:toolforge: Remove obsolete spec test [puppet] - 10https://gerrit.wikimedia.org/r/1198349 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [15:05:09] (03CR) 10Majavah: [C:03+2] P:toolforge: Remove obsolete spec test [puppet] - 10https://gerrit.wikimedia.org/r/1198349 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [15:05:40] (03PS2) 10Andrew Bogott: preseed: (TEMPORARY) switch maps-test hosts to raid 10 [puppet] - 10https://gerrit.wikimedia.org/r/1198344 (https://phabricator.wikimedia.org/T407586) [15:07:26] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [15:08:37] (03CR) 10Krinkle: "Logic was removed in https://gerrit.wikimedia.org/r/1194558 for T405931." [puppet] - 10https://gerrit.wikimedia.org/r/959686 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [15:08:58] (03CR) 10Andrew Bogott: [C:03+2] preseed: (TEMPORARY) switch maps-test hosts to raid 10 [puppet] - 10https://gerrit.wikimedia.org/r/1198344 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [15:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:28] (03PS15) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [15:10:18] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [15:10:56] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7402/console" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [15:15:33] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host maps-test2002.codfw.wmnet with OS trixie [15:16:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2002.codfw.wmnet with OS trixie [15:23:21] (03CR) 10Daniel Kinzler: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [15:24:15] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:13] (03PS1) 10CDanis: discovery.wmnet: add gerrit alias [dns] - 10https://gerrit.wikimedia.org/r/1198352 (https://phabricator.wikimedia.org/T365259) [15:28:15] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [15:28:15] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1005.eqiad.wmnet with OS trixie [15:28:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11303431 (10DMburugu) Yes, I approve this. [15:28:57] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2078.codfw.wmnet [15:29:23] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2078.codfw.wmnet [15:30:04] (03PS11) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [15:32:00] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host maps-test2002.codfw.wmnet with OS trixie [15:32:43] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2203'] [15:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:37] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2002.codfw.wmnet with OS trixie [15:37:05] (03CR) 10BCornwall: [C:03+2] wikimedia.org: Add Figma domain verification [dns] - 10https://gerrit.wikimedia.org/r/1198191 (https://phabricator.wikimedia.org/T408003) (owner: 10BCornwall) [15:37:19] !log brett@dns1004 START - running authdns-update [15:38:03] !log brett@dns1004 END - running authdns-update [15:38:56] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1006.eqiad.wmnet with OS trixie [15:39:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11303487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet with... [15:40:00] (03PS1) 10Elukey: sre.hardware: improve Dell IDRAC's version pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/1198355 (https://phabricator.wikimedia.org/T406964) [15:40:55] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2078.codfw.wmnet [15:40:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Add es2057 T402859', diff saved to https://phabricator.wikimedia.org/P84277 and previous config saved to /var/cache/conftool/dbconfig/20251023-154056-fceratto.json [15:41:03] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [15:41:05] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2057.codfw.wmnet [15:41:05] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2057.codfw.wmnet [15:41:16] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2078.codfw.wmnet [15:41:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11303503 (10elukey) Ok so turned out that the aforementioned file was just a test, but `iDRAC-with-Lifecycle-Controller_Firmware_VP... [15:42:49] (03CR) 10Jelto: [C:04-1] "thank you for the change! one comment in line" [dns] - 10https://gerrit.wikimedia.org/r/1198352 (https://phabricator.wikimedia.org/T365259) (owner: 10CDanis) [15:43:19] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-worker2203'] [15:43:25] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2203'] [15:43:38] (03CR) 10JHathaway: [C:03+1] sre.hardware: improve Dell IDRAC's version pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/1198355 (https://phabricator.wikimedia.org/T406964) (owner: 10Elukey) [15:44:07] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-worker2203'] [15:44:38] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11303514 (10BCornwall) This has also caused our varnish test suite to fail as browser-detection.inc.vcl does not exist. [15:44:48] (03CR) 10Daniel Kinzler: api-gateway: support per-route rate limit groups for rest gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 (owner: 10Daniel Kinzler) [15:45:41] (03PS2) 10CDanis: discovery.wmnet: add gerrit alias [dns] - 10https://gerrit.wikimedia.org/r/1198352 (https://phabricator.wikimedia.org/T365259) [15:45:49] (03CR) 10CDanis: discovery.wmnet: add gerrit alias (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1198352 (https://phabricator.wikimedia.org/T365259) (owner: 10CDanis) [15:46:05] (03PS2) 10Bking: WIP: deploy a test OpenSearch cluster in opensearch-ipoid-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) [15:46:55] (03CR) 10Bking: WIP: deploy a test OpenSearch cluster in opensearch-ipoid-test ns (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [15:48:33] (03CR) 10Jelto: [C:03+1] "lgtm now, also the additional step for the failover procedure is acceptable imho, the CDN migration is a bit more time critical at the mom" [dns] - 10https://gerrit.wikimedia.org/r/1198352 (https://phabricator.wikimedia.org/T365259) (owner: 10CDanis) [15:48:33] (03CR) 10DCausse: [C:03+1] "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1174723 (https://phabricator.wikimedia.org/T408026) (owner: 10Gmodena) [15:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:53:41] (03PS1) 10Andrew Bogott: Revert "preseed: (TEMPORARY) switch maps-test hosts to raid 10" [puppet] - 10https://gerrit.wikimedia.org/r/1198358 [15:53:53] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [15:54:08] (03PS2) 10Andrew Bogott: Revert "preseed: (TEMPORARY) switch maps-test hosts to raid 10" [puppet] - 10https://gerrit.wikimedia.org/r/1198358 (https://phabricator.wikimedia.org/T407586) [15:55:06] (03PS1) 10Andrew Bogott: cloudcontrol2010-dev: switch from sw raid10 to raid5 [puppet] - 10https://gerrit.wikimedia.org/r/1198359 (https://phabricator.wikimedia.org/T407586) [15:55:16] (03PS3) 10Andrew Bogott: Revert "preseed: (TEMPORARY) switch maps-test hosts to raid 10" [puppet] - 10https://gerrit.wikimedia.org/r/1198358 (https://phabricator.wikimedia.org/T407586) [15:55:17] (03PS2) 10Andrew Bogott: cloudcontrol2010-dev: switch from sw raid10 to raid5 [puppet] - 10https://gerrit.wikimedia.org/r/1198359 (https://phabricator.wikimedia.org/T407586) [15:56:26] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "preseed: (TEMPORARY) switch maps-test hosts to raid 10" [puppet] - 10https://gerrit.wikimedia.org/r/1198358 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [15:56:34] (03CR) 10Filippo Giunchedi: [C:03+1] cloudcontrol2010-dev: switch from sw raid10 to raid5 [puppet] - 10https://gerrit.wikimedia.org/r/1198359 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [15:58:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [15:58:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2057 slowly with 10 steps - Pooling in new host [15:58:59] (03CR) 10Andrew Bogott: [C:03+2] Revert "preseed: (TEMPORARY) switch maps-test hosts to raid 10" [puppet] - 10https://gerrit.wikimedia.org/r/1198358 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [15:59:01] (03CR) 10Andrew Bogott: [C:03+2] cloudcontrol2010-dev: switch from sw raid10 to raid5 [puppet] - 10https://gerrit.wikimedia.org/r/1198359 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [15:59:19] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11303572 (10CDanis) `tcp-proxy` sounds good to me as a name. 2vcpu/2Gi/20Gi also SGTM. I could maybe imagine going smaller, but whatever, it'... [16:00:05] jhathaway and moritzm: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:46] (03PS3) 10Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) [16:02:59] (03CR) 10Clément Goubert: [C:03+2] ~daniel/.screenrc: force login shell [puppet] - 10https://gerrit.wikimedia.org/r/1198305 (https://phabricator.wikimedia.org/T404739) (owner: 10Daniel Kinzler) [16:05:04] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sretest2001.codfw.wmnet with reason: T383173 [16:05:11] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [16:06:15] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host maps-test2002.codfw.wmnet with OS trixie [16:07:01] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2002.codfw.wmnet with OS bookworm [16:11:11] (03PS1) 10Andrew Bogott: Correct preseed entries for cloudcontrol10[8,9,10]-dev [puppet] - 10https://gerrit.wikimedia.org/r/1198363 (https://phabricator.wikimedia.org/T342455) [16:12:32] (03CR) 10JHathaway: "Another option, as mentioned in `sysctl.d(5)` is to use a udev rule, I tested this one:" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [16:13:56] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [16:14:48] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:15:04] (03CR) 10Andrew Bogott: [C:03+2] Correct preseed entries for cloudcontrol10[8,9,10]-dev [puppet] - 10https://gerrit.wikimedia.org/r/1198363 (https://phabricator.wikimedia.org/T342455) (owner: 10Andrew Bogott) [16:16:08] (03PS1) 10Clément Goubert: Revert "~daniel/.screenrc: force login shell" [puppet] - 10https://gerrit.wikimedia.org/r/1198367 [16:16:17] (03CR) 10Clément Goubert: [V:03+2 C:03+2] Revert "~daniel/.screenrc: force login shell" [puppet] - 10https://gerrit.wikimedia.org/r/1198367 (owner: 10Clément Goubert) [16:17:01] (03CR) 10Bking: Add OpenSearch cluster configs for net-new clusters (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [16:20:05] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [16:20:27] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [16:22:10] (03PS3) 10Clare Ming: Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) [16:23:49] (03CR) 10Filippo Giunchedi: "Nice find! Yes I think that ought to work and cater for module unload too. And yes I think there shouldn't be too many modules." [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [16:23:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:25:16] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:25:42] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:27:17] (03PS1) 10BCornwall: varnish: Stub browser-detection for tests [labs/private] - 10https://gerrit.wikimedia.org/r/1198369 (https://phabricator.wikimedia.org/T404826) [16:28:10] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [16:30:34] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [16:31:00] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [16:33:41] (03CR) 10JHathaway: "> Nice find! Yes I think that ought to work and cater for module unload too. And yes I think there shouldn't be too many modules." [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [16:35:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [16:38:49] (03CR) 10Elukey: [C:03+2] sre.hardware: improve Dell IDRAC's version pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/1198355 (https://phabricator.wikimedia.org/T406964) (owner: 10Elukey) [16:43:33] (03PS1) 10Cathal Mooney: Include statements in reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/1198370 (https://phabricator.wikimedia.org/T396063) [16:44:07] (03CR) 10CI reject: [V:04-1] Include statements in reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/1198370 (https://phabricator.wikimedia.org/T396063) (owner: 10Cathal Mooney) [16:47:05] (03PS1) 10Andrew Bogott: Revert "cloudcontrol2010-dev: switch from sw raid10 to raid5" [puppet] - 10https://gerrit.wikimedia.org/r/1198371 (https://phabricator.wikimedia.org/T407586) [16:48:19] 07Puppet, 10Beta-Cluster-Infrastructure: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877#11303762 (10Krinkle) Is something preventing this fix from applying to labs/private? * https://codesearch.wmcloud.... [16:48:31] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [16:51:55] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverses for 10.64.186.1 - cmooney@cumin1003" [16:51:59] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverses for 10.64.186.1 - cmooney@cumin1003" [16:51:59] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:52:08] (03PS1) 10DCausse: search: alert on index failures [alerts] - 10https://gerrit.wikimedia.org/r/1198372 (https://phabricator.wikimedia.org/T402629) [16:52:52] 06SRE, 06cloud-services-team, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11303781 (10Andrew) I can reproduce this on a second server (cloudcontrol1008-dev) which is also a config b R450. I could /not/ reproduce this on mapstest1002-dev... [16:52:52] (03PS2) 10Cathal Mooney: Include statements in reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/1198370 (https://phabricator.wikimedia.org/T396063) [16:53:23] (03PS1) 10Krinkle: puppetserver: Generalize git-rebase fix to work for labs/private [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) [16:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:54:24] 06SRE, 06cloud-services-team, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11303790 (10Andrew) On my last attempt on 2010-dev (raid5) I'm intrigued by this out of memory error: ` Booting from Hard drive C: GRUB loading.. Welcome to GRUB... [16:55:03] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877#11303795 (10bd808) >>! In T397877#11303762, @Krinkle wrote: > It seems this script is shared... [16:55:10] 06SRE, 06cloud-services-team, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure on R540 hardware - https://phabricator.wikimedia.org/T407586#11303796 (10Andrew) [16:56:03] 06SRE, 06cloud-services-team, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11303803 (10Andrew) [16:56:15] 06SRE, 06cloud-services-team, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11303807 (10Andrew) [16:57:22] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:57:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:58:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2002.codfw.wmnet with OS bookworm [17:00:05] bd808: Time to snap out of that daydream and deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1700) [17:00:24] (03PS4) 10Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) [17:00:33] nothing to do in my deploy window today [17:03:57] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11303822 (10Raine) [17:05:46] 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T408003#11303825 (10JKelsoteel-WMF) Thank you @BCornwall - works on my end! [17:07:12] (03CR) 10Kamila Součková: admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [17:07:27] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11303839 (10Krinkle) >>! In T407966#11299911, @ssingh merged: > %%%[operations/puppet@production] varnish: add conditional to varnish::common::vcl for bet... [17:07:29] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11303841 (10Jclark-ctr) [17:07:45] (03PS4) 10Krinkle: varnish: Remove unreachable optin=beta code [puppet] - 10https://gerrit.wikimedia.org/r/1197730 (https://phabricator.wikimedia.org/T405931) [17:08:15] (03CR) 10Kamila Součková: [C:04-2] "DNM, waiting for approval" [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [17:08:58] (03CR) 10Kamila Součková: "Approved" [puppet] - 10https://gerrit.wikimedia.org/r/1198340 (https://phabricator.wikimedia.org/T408008) (owner: 10Kamila Součková) [17:10:31] (03PS1) 10Krinkle: [DNM] varnish: Temporary browser-detection.inc.vcl stub for PCC/VTC [puppet] - 10https://gerrit.wikimedia.org/r/1198375 [17:24:26] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11303873 (10Lars) Thanks, working on my end. [17:26:05] (03CR) 10BCornwall: [C:03+2] varnish: Remove unreachable optin=beta code [puppet] - 10https://gerrit.wikimedia.org/r/1197730 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [17:33:43] (03CR) 10BCornwall: [C:03+1] Include statements in reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/1198370 (https://phabricator.wikimedia.org/T396063) (owner: 10Cathal Mooney) [17:35:17] (03PS2) 10Krinkle: puppetserver: Generalize git-rebase fix to work for labs/private [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) [17:36:15] (03PS2) 10Krinkle: [DNM] varnish: Temporary browser-detection.inc.vcl stub for PCC/VTC [puppet] - 10https://gerrit.wikimedia.org/r/1198375 [17:41:12] (03PS3) 10Krinkle: puppetserver: Generalize git-rebase fix to work for labs/private [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) [17:42:31] (03Abandoned) 10Ebernhardson: SUP: upgrade Java 17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196470 (https://phabricator.wikimedia.org/T404417) (owner: 10Peter Fischer) [17:45:47] (03CR) 10Ryan Kemper: [C:03+1] "Tested and working; just going to merge (no blast radius)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198206 (https://phabricator.wikimedia.org/T408063) (owner: 10Ryan Kemper) [17:45:48] (03CR) 10Ryan Kemper: [C:03+2] wdqs: don't nuke data_loaded file for categ xfer [cookbooks] - 10https://gerrit.wikimedia.org/r/1198206 (https://phabricator.wikimedia.org/T408063) (owner: 10Ryan Kemper) [17:46:40] FIRING: DiskSpace: Disk space ml-serve1012:9100:/ 4.801% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:46:42] andrew@cumin2002 reimage (PID 120690) is awaiting input [17:48:13] (03PS4) 10Krinkle: puppetserver: Generalize git-rebase fix to work for labs/private [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) [17:51:40] RESOLVED: DiskSpace: Disk space ml-serve1012:9100:/ 4.756% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:52:15] andrew@cumin2002 reimage (PID 123669) is awaiting input [17:54:51] (03Merged) 10jenkins-bot: wdqs: don't nuke data_loaded file for categ xfer [cookbooks] - 10https://gerrit.wikimedia.org/r/1198206 (https://phabricator.wikimedia.org/T408063) (owner: 10Ryan Kemper) [17:55:40] FIRING: DiskSpace: Disk space ml-serve1012:9100:/ 4.79% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:58:29] (03CR) 10Ebernhardson: [C:03+2] search: alert on index failures [alerts] - 10https://gerrit.wikimedia.org/r/1198372 (https://phabricator.wikimedia.org/T402629) (owner: 10DCausse) [17:59:58] (03Merged) 10jenkins-bot: search: alert on index failures [alerts] - 10https://gerrit.wikimedia.org/r/1198372 (https://phabricator.wikimedia.org/T402629) (owner: 10DCausse) [18:00:04] dancy and andre: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T1800). nyaa~ [18:01:26] (03CR) 10Scardenasmolinar: [C:03+1] Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [18:04:16] (03CR) 10BCornwall: [C:03+2] varnish: Enable enable_m_redir in esams and drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1197694 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [18:08:32] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11304003 (10Dzahn) I took the liberty and added `tcp-proxy` to https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Hostna... [18:09:16] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11304006 (10BCornwall) 05Resolved→03Open Not resolved due to broken tests. [18:10:14] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy1001.eqiad.wmnet [18:10:16] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:15:06] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2057 slowly with 10 steps - Pooling in new host [18:15:25] (03PS1) 10Dzahn: site: add tcp-proxy node stanza, insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1198378 (https://phabricator.wikimedia.org/T408064) [18:15:47] (03CR) 10CI reject: [V:04-1] site: add tcp-proxy node stanza, insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1198378 (https://phabricator.wikimedia.org/T408064) (owner: 10Dzahn) [18:15:55] dzahn@cumin1002 makevm (PID 1312578) is awaiting input [18:16:03] (03PS2) 10Dzahn: site: add tcp-proxy node stanza, insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1198378 (https://phabricator.wikimedia.org/T408064) [18:17:50] (03CR) 10Dzahn: "for now just to continue running the makevm cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/1198378 (https://phabricator.wikimedia.org/T408064) (owner: 10Dzahn) [18:18:22] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11304050 (10VRiley-WMF) I have hooked up the cables and awaiting to see if they are up and active. [18:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:21:32] dzahn@cumin2002 makevm (PID 147124) is awaiting input [18:22:30] cdanis: if you still like tcp-proxy.. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198378 [18:22:59] (03CR) 10CDanis: [C:03+1] site: add tcp-proxy node stanza, insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1198378 (https://phabricator.wikimedia.org/T408064) (owner: 10Dzahn) [18:23:05] alright;) [18:23:12] thanks! [18:23:46] yep, going ahead with this seemed like a good way to contribute something right now [18:24:10] (03CR) 10Dzahn: [C:03+2] site: add tcp-proxy node stanza, insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1198378 (https://phabricator.wikimedia.org/T408064) (owner: 10Dzahn) [18:25:20] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy1001.eqiad.wmnet - dzahn@cumin1002" [18:25:25] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy1001.eqiad.wmnet - dzahn@cumin1002" [18:25:25] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:25] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache tcp-proxy1001.eqiad.wmnet on all recursors [18:25:28] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy1001.eqiad.wmnet on all recursors [18:25:47] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [18:25:50] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:25:57] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy1001.eqiad.wmnet - dzahn@cumin1002" [18:26:02] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy1001.eqiad.wmnet - dzahn@cumin1002" [18:27:23] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host tcp-proxy1001.eqiad.wmnet with OS trixie [18:27:40] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11304086 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host tcp-proxy100... [18:29:12] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [18:29:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [18:29:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:29:18] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [18:29:21] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [18:29:52] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [18:29:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [18:30:15] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11304094 (10Dzahn) 05Open→03In progress Ready to create Ganeti VM tcp-proxy1001.eqiad.wmnet in the eqiad cluster on g... [18:30:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy2001.codfw.wmnet with OS trixie [18:30:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11304098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy2001.codfw.wmnet with OS... [18:32:10] (03PS1) 10Dzahn: installserver: add partman for tcp-proxy VMs, standard [puppet] - 10https://gerrit.wikimedia.org/r/1198380 (https://phabricator.wikimedia.org/T408064) [18:32:49] (03CR) 10Dzahn: [C:03+2] installserver: add partman for tcp-proxy VMs, standard [puppet] - 10https://gerrit.wikimedia.org/r/1198380 (https://phabricator.wikimedia.org/T408064) (owner: 10Dzahn) [18:33:25] (03PS1) 10Bking: admin_ng (dse-k8s): watch more OpenSearch-related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198381 (https://phabricator.wikimedia.org/T357753) [18:33:46] (03CR) 10Dzahn: [V:03+2 C:03+2] installserver: add partman for tcp-proxy VMs, standard [puppet] - 10https://gerrit.wikimedia.org/r/1198380 (https://phabricator.wikimedia.org/T408064) (owner: 10Dzahn) [18:35:17] (03CR) 10Ssingh: [C:03+1] "Looks good! Safe NOOP change but let's merge Monday. Nice job." [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [18:39:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004#11304147 (10Jhancock.wm) a:03Jhancock.wm got logged into the idrac and found these errors. ` The system board Pfault fail-safe voltage is outsi... [18:40:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004#11304158 (10Raine) Thanks @Jhancock.wm, appreciated! No worries, this is really not urgent. [18:43:05] (03CR) 10Bking: [C:03+2] admin_ng (dse-k8s): watch more OpenSearch-related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198381 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [18:43:20] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198381 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [18:50:27] (03Merged) 10jenkins-bot: admin_ng (dse-k8s): watch more OpenSearch-related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198381 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [18:53:08] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:53:09] (03PS7) 10Herron: profile::thanos::query::store_config: add define [puppet] - 10https://gerrit.wikimedia.org/r/1197669 (https://phabricator.wikimedia.org/T406054) [18:53:19] (03PS19) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) [18:54:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:56:28] Pressing the train button now [18:56:54] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198382 (https://phabricator.wikimedia.org/T405680) [18:56:56] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198382 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [18:57:48] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198382 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [18:59:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:00] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.24 refs T405680 [19:06:04] T405680: 1.45.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T405680 [19:06:49] (03PS1) 10Scott French: deployment_server: Return production mediawiki releases to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1198383 (https://phabricator.wikimedia.org/T405955) [19:06:50] (03PS1) 10Scott French: Reenable enrollment in PHP 8.3 at 1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198384 (https://phabricator.wikimedia.org/T405955) [19:12:55] (03CR) 10JHathaway: [C:03+2] sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:13:04] dzahn@cumin1002 makevm (PID 1312578) is awaiting input [19:13:14] (03CR) 10JHathaway: [V:03+2 C:03+2] sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:13:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11304271 (10Raine) a:05DMburugu→03Raine >>! In T408008#11303431, @DMburugu wrote: > Yes, I approve this. Thanks! [19:20:44] (03CR) 10Herron: [C:03+2] profile::thanos::query::store_config: add define [puppet] - 10https://gerrit.wikimedia.org/r/1197669 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [19:21:57] dzahn@cumin2002 makevm (PID 147124) is awaiting input [19:24:15] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:26] (03PS1) 10Daniel Kinzler: api-gateway: make cookie name configurable for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) [19:25:05] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy2001.codfw.wmnet with OS trixie [19:25:06] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [19:25:25] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11304302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy2001.co... [19:25:46] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [19:25:49] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:26:30] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy1001.eqiad.wmnet with OS trixie [19:26:30] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy1001.eqiad.wmnet [19:26:45] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11304303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host tcp-proxy1001.eq... [19:27:05] (03PS1) 10Herron: profile::thanos::query::store_config: update config path [puppet] - 10https://gerrit.wikimedia.org/r/1198386 [19:27:12] (03PS1) 10Dzahn: admin: remove pre-yubikey ssh key for dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1198387 (https://phabricator.wikimedia.org/T407917) [19:29:16] (03CR) 10Herron: [C:03+2] profile::thanos::query::store_config: update config path [puppet] - 10https://gerrit.wikimedia.org/r/1198386 (owner: 10Herron) [19:29:24] FIRING: SLOMetricAbsent: edit-check-pre-save-checks-ratio - https://slo.wikimedia.org/?search=edit-check-pre-save-checks-ratio - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:29:53] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:30:00] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [19:30:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [19:30:08] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:32:52] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:32:53] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [19:32:56] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [19:33:05] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [19:34:24] RESOLVED: SLOMetricAbsent: edit-check-pre-save-checks-ratio - https://slo.wikimedia.org/?search=edit-check-pre-save-checks-ratio - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:36:35] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11304332 (10ssingh) >>! In T404826#11303839, @Krinkle wrote: >>>! In T407966#11299911, @ssingh merged: >> %%%[operations/puppet@production] varnish: add c... [19:37:16] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host tcp-proxy1001.eqiad.wmnet with OS trixie [19:37:24] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11304337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host tcp-proxy1001.eqiad.wmnet with OS... [19:41:25] (03PS20) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) [19:42:27] (03PS1) 10Krinkle: MentorDashboard,UserImpact: bump cache version and set proper keygroup [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198389 (https://phabricator.wikimedia.org/T407403) [19:43:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198389 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [19:45:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11304363 (10RobH) p:05Triage→03Medium [19:46:42] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy1001.eqiad.wmnet with reason: host reimage [19:46:44] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11304372 (10bd808) [19:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:50:43] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy1001.eqiad.wmnet with reason: host reimage [19:54:35] (03PS21) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T2000) [20:00:05] cjming, JSherman, and Krinkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] I'm here and logged into spiderpig [20:00:34] (03CR) 10Herron: [C:03+2] thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [20:00:40] o/ [20:00:56] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [20:00:59] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:01:15] mind if i go first? then i can pass to next in queue - i think everyone in the window can self-deploy [20:01:22] works for me! [20:01:25] ty! [20:02:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [20:02:18] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy1001.eqiad.wmnet with OS trixie [20:02:33] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11304396 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host tcp-proxy1001.eqiad.wmnet with OS trix... [20:02:56] (03Merged) 10jenkins-bot: Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [20:03:15] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1197344|Add config for xLab MW Module experiment (T401705)]] [20:03:20] T401705: Implement debugging for events in the Javascript SDK - https://phabricator.wikimedia.org/T401705 [20:03:34] (03PS3) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) [20:03:41] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:03:41] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [20:03:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [20:03:49] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:04:18] (03PS1) 10Əkrəm: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) [20:07:20] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:07:32] !log cjming@deploy2002 cjming: Backport for [[gerrit:1197344|Add config for xLab MW Module experiment (T401705)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:40] testing [20:07:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:07:52] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:07:52] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [20:07:55] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [20:08:04] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [20:08:31] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [20:08:33] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:10:27] !log cjming@deploy2002 cjming: Continuing with sync [20:10:37] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [20:10:40] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [20:10:40] (03CR) 10Dzahn: [C:03+2] admin: remove pre-yubikey ssh key for dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1198387 (https://phabricator.wikimedia.org/T407917) (owner: 10Dzahn) [20:12:12] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:12:27] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [20:12:58] (03PS1) 10Cwhite: aptrepo: add component/curator5 to bullseye/bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1198392 (https://phabricator.wikimedia.org/T407199) [20:13:20] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [20:13:22] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:14:30] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [20:14:38] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197344|Add config for xLab MW Module experiment (T401705)]] (duration: 11m 23s) [20:14:42] T401705: Implement debugging for events in the Javascript SDK - https://phabricator.wikimedia.org/T401705 [20:14:51] JSherman: all yours [20:14:58] cjming: thanks! [20:15:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [20:16:12] (03Merged) 10jenkins-bot: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [20:16:28] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1192956|Set AutoModeratorMultiLingualRevertRisk with available wikis (T400727)]] [20:16:33] T400727: set AutoModeratorMultiLingualRevertRisk with available wikis - https://phabricator.wikimedia.org/T400727 [20:16:46] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:16:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:16:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:16:51] (03Abandoned) 10Cwhite: aptrepo: add component/curator5 to bullseye/bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1198392 (https://phabricator.wikimedia.org/T407199) (owner: 10Cwhite) [20:16:52] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [20:16:55] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [20:17:04] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:17:08] (03CR) 10Cwhite: [C:03+1] Update the apt components used for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [20:18:49] (03PS1) 10Dzahn: site: add tcp-proxy in all 7 DCs [puppet] - 10https://gerrit.wikimedia.org/r/1198393 (https://phabricator.wikimedia.org/T408064) [20:19:57] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: replace ssh keys with yubikey-backed key for Daniel Z - https://phabricator.wikimedia.org/T407917#11304447 (10Dzahn) 05Open→03Resolved self-removed my old key - tested it was removed on deploy1003 and I could still SSH to it afterwards [20:20:37] !log jsn@deploy2002 jsn, kgraessle: Backport for [[gerrit:1192956|Set AutoModeratorMultiLingualRevertRisk with available wikis (T400727)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:20:52] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11304451 (10RKemper) Met with rzl. #### Discussion highlights - Went over the general philosophical distincti... [20:20:59] testing [20:22:17] (03PS2) 10Dzahn: site: add tcp-proxy in all 7 DCs [puppet] - 10https://gerrit.wikimedia.org/r/1198393 (https://phabricator.wikimedia.org/T408064) [20:22:19] !log jsn@deploy2002 Sync cancelled. [20:22:49] dzahn@cumin2002 makevm (PID 171277) is awaiting input [20:23:40] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:23:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM tcp-proxy2001.codfw.wmnet - dzahn@cumin2002" [20:23:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:23:46] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [20:23:49] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [20:23:58] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy2001.codfw.wmnet [20:24:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:24:10] (03PS1) 10Jsn.sherman: Revert "Set AutoModeratorMultiLingualRevertRisk with available wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198394 [20:24:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198394 (owner: 10Jsn.sherman) [20:24:41] change unhappy in testing; reverting [20:25:19] (03Merged) 10jenkins-bot: Revert "Set AutoModeratorMultiLingualRevertRisk with available wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198394 (owner: 10Jsn.sherman) [20:25:37] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1198394|Revert "Set AutoModeratorMultiLingualRevertRisk with available wikis"]] [20:29:42] !log jsn@deploy2002 jsn: Backport for [[gerrit:1198394|Revert "Set AutoModeratorMultiLingualRevertRisk with available wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:30:19] !log jsn@deploy2002 jsn: Continuing with sync [20:30:24] (03PS2) 10Krinkle: MentorDashboard,UserImpact: bump cache version and set proper keygroup [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198389 (https://phabricator.wikimedia.org/T407403) [20:30:38] (03PS1) 10Krinkle: fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198395 (https://phabricator.wikimedia.org/T407403) [20:30:57] (03PS1) 10Krinkle: MentorDashboard,UserImpact: bump cache version and set proper keygroup [extensions/GrowthExperiments] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198396 (https://phabricator.wikimedia.org/T407403) [20:33:00] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [20:34:17] (03PS1) 10Dzahn: site/role: create placeholder role for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408064) [20:34:30] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198394|Revert "Set AutoModeratorMultiLingualRevertRisk with available wikis"]] (duration: 08m 53s) [20:34:38] Krinkle: All yours [20:34:47] thx [20:34:55] (03CR) 10CI reject: [V:04-1] site/role: create placeholder role for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408064) (owner: 10Dzahn) [20:36:32] (03PS2) 10Dzahn: site/role: create placeholder role for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408064) [20:37:13] (03CR) 10RLazarus: [C:03+1] deployment_server: Return production mediawiki releases to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1198383 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:37:39] (03PS3) 10Krinkle: MentorDashboard,UserImpact: Bump cache and set proper keygroup [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198389 (https://phabricator.wikimedia.org/T407403) [20:37:40] (03CR) 10RLazarus: [C:03+1] Reenable enrollment in PHP 8.3 at 1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198384 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:38:00] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [20:38:04] (03PS2) 10Krinkle: MentorDashboard,UserImpact: Bump cache and set proper keygroup [extensions/GrowthExperiments] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198396 (https://phabricator.wikimedia.org/T407403) [20:39:12] MichaelG_WMF: Is it okay to backport the cache bump in ~20min? [20:39:41] I want to respect the CR if it's not ready. It seems like the code is not at issue but want to make sure. [20:42:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198395 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [20:45:16] (03Merged) 10jenkins-bot: fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration [extensions/GrowthExperiments] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198395 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [20:45:39] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1198395|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]] [20:45:44] T407403: Error: Invalid serialization data for DatePeriod object - https://phabricator.wikimedia.org/T407403 [20:49:38] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1198395|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:50:43] 10SRE-SLO, 06SRE Observability (FY2025/2026-Q1): Thanos: support multiple ruler instances - https://phabricator.wikimedia.org/T406054#11304559 (10herron) 05In progress→03Resolved We now have two thanos rule instances running, "main" (the pre-existing instance) and a new instance called "pilot" Each in... [20:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:56:36] !log krinkle@deploy2002 krinkle: Continuing with sync [20:58:30] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11304584 (10bd808) > For browser detection, we've created a private gitlab repository to host some code we don't want to be publicly available, to avoid o... [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T2100) [21:00:44] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198395|fix(MentorDashboard): fix caching for PHP 8.1 -> 8.3 migration (T407403)]] (duration: 15m 06s) [21:00:49] T407403: Error: Invalid serialization data for DatePeriod object - https://phabricator.wikimedia.org/T407403 [21:02:11] (03CR) 10Scott French: "Thanks for the review! Given the hour, I'll hold off on deploying this until Monday. However, I will go ahead and send subsequent patches " [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [21:02:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198389 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [21:02:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198396 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [21:03:51] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:04:51] (03Merged) 10jenkins-bot: MentorDashboard,UserImpact: Bump cache and set proper keygroup [extensions/GrowthExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198389 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [21:07:03] (03CR) 10Bking: [C:03+1] "Post-review +1 FTW" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198206 (https://phabricator.wikimedia.org/T408063) (owner: 10Ryan Kemper) [21:09:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:12:16] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:14:29] (03Merged) 10jenkins-bot: MentorDashboard,UserImpact: Bump cache and set proper keygroup [extensions/GrowthExperiments] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1198396 (https://phabricator.wikimedia.org/T407403) (owner: 10Krinkle) [21:19:01] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11304609 (10CDanis) >>! In T404826#11304584, @bd808 wrote: >> For browser detection, we've created a private gitlab repository to host some code we don't... [21:19:07] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [21:20:44] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:20:49] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [21:29:01] (03PS11) 10Scott French: P:cache::haproxy: move x_requestctl setup into listen section [puppet] - 10https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) [21:29:01] (03CR) 10Scott French: "Thanks in advance for the review, Fabrizio!" [puppet] - 10https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [21:30:34] (03PS1) 10Ryan Kemper: wdqs.data-transfer: make --force behavior default [cookbooks] - 10https://gerrit.wikimedia.org/r/1198399 (https://phabricator.wikimedia.org/T408163) [21:31:27] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1198389|MentorDashboard,UserImpact: Bump cache and set proper keygroup (T407403)]], [[gerrit:1198396|MentorDashboard,UserImpact: Bump cache and set proper keygroup (T407403)]] [21:31:32] T407403: Error: Invalid serialization data for DatePeriod object - https://phabricator.wikimedia.org/T407403 [21:31:49] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:31:51] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:31:54] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [21:32:07] (03CR) 10Bking: [C:03+1] wdqs.data-transfer: make --force behavior default [cookbooks] - 10https://gerrit.wikimedia.org/r/1198399 (https://phabricator.wikimedia.org/T408163) (owner: 10Ryan Kemper) [21:33:30] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1198389|MentorDashboard,UserImpact: Bump cache and set proper keygroup (T407403)]], [[gerrit:1198396|MentorDashboard,UserImpact: Bump cache and set proper keygroup (T407403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:37:44] (03CR) 10CI reject: [V:04-1] wdqs.data-transfer: make --force behavior default [cookbooks] - 10https://gerrit.wikimedia.org/r/1198399 (https://phabricator.wikimedia.org/T408163) (owner: 10Ryan Kemper) [21:37:55] !log krinkle@deploy2002 krinkle: Continuing with sync [21:39:21] (03PS2) 10Ryan Kemper: wdqs.data-transfer: make --force behavior default [cookbooks] - 10https://gerrit.wikimedia.org/r/1198399 (https://phabricator.wikimedia.org/T408163) [21:41:49] 06SRE, 10SRE-Access-Requests: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164 (10JMoore-WMF) 03NEW [21:42:01] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198389|MentorDashboard,UserImpact: Bump cache and set proper keygroup (T407403)]], [[gerrit:1198396|MentorDashboard,UserImpact: Bump cache and set proper keygroup (T407403)]] (duration: 10m 34s) [21:42:06] T407403: Error: Invalid serialization data for DatePeriod object - https://phabricator.wikimedia.org/T407403 [21:43:03] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:43:07] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [21:44:44] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS trixie [21:45:05] (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudcontrol2010-dev: switch from sw raid10 to raid5" [puppet] - 10https://gerrit.wikimedia.org/r/1198371 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [21:45:42] 06SRE, 10SRE-Access-Requests: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11304674 (10MMiller_WMF) I am Justin's manager and I approve these requests. [21:49:22] Krinkle: once the dust settles and things look good, give me a heads up and I can get started moving some traffic back onto 8.3. [21:50:07] jouncebot: nowandnext [21:50:07] For the next 0 hour(s) and 9 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251023T2100) [21:50:07] In 8 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251024T0600) [21:52:51] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet, repooling both afterwards [21:52:56] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [21:54:25] swfrench-wmf: LGTM. Feel free to :) [21:55:16] Krinkle: awesome, thank you! [21:55:55] FIRING: DiskSpace: Disk space ml-serve1012:9100:/ 2.138% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ml-serve1012 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:56:40] (03CR) 10Scott French: [C:03+2] deployment_server: Return production mediawiki releases to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1198383 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:00:11] FYI, I have a puppet-agent run ongoing on deploy2002. once that completes (ETA 5m), there will be two scap deployments in succession: a sync-world that will take 10-20m, followed by a backport that should be faster [22:00:50] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [22:04:13] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:21] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet, repooling both afterwards [22:04:24] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1013.eqiad.wmnet, repooling both afterwards [22:04:26] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [22:04:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [22:05:13] !log swfrench@deploy2002 Started scap sync-world: Return next/migration releases to 8.3 - T405955 [22:05:18] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [22:10:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:10:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:14:01] !log restart apache2 on gerrit1003 [22:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:05] !log swfrench@deploy2002 Finished scap sync-world: Return next/migration releases to 8.3 - T405955 (duration: 09m 52s) [22:15:10] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [22:15:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:15:50] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1013.eqiad.wmnet, repooling both afterwards [22:15:52] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1014.eqiad.wmnet, repooling both afterwards [22:15:54] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [22:16:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198384 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:17:35] (03Merged) 10jenkins-bot: Reenable enrollment in PHP 8.3 at 1% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198384 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:17:54] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1198384|Reenable enrollment in PHP 8.3 at 1% (T405955)]] [22:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:22:05] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1198384|Reenable enrollment in PHP 8.3 at 1% (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:22:10] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [22:23:32] !log swfrench@deploy2002 swfrench: Continuing with sync [22:23:55] FIRING: [2x] SystemdUnitFailed: wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:27:22] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1014.eqiad.wmnet, repooling both afterwards [22:27:24] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1015.eqiad.wmnet, repooling both afterwards [22:27:27] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [22:27:33] (03PS1) 10Clare Ming: ext.xLab: Implement UnenrolledExperiment#setStream() [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198404 [22:27:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198404 (owner: 10Clare Ming) [22:28:55] FIRING: [3x] SystemdUnitFailed: wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:00] (03PS1) 10Aaron Schulz: Move rest_v1-wikimedia.json under the wwwportal directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198405 (https://phabricator.wikimedia.org/T396805) [22:30:04] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198384|Reenable enrollment in PHP 8.3 at 1% (T405955)]] (duration: 12m 10s) [22:30:10] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [22:34:29] (03PS1) 10Aaron Schulz: restgateway: update spec-json-wikimedia to use www prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198406 (https://phabricator.wikimedia.org/T396805) [22:35:51] (03PS3) 10Andrea Denisse: alertmanager: Add support for team mentions on the Slack template [puppet] - 10https://gerrit.wikimedia.org/r/1194321 (https://phabricator.wikimedia.org/T408145) [22:35:51] (03CR) 10Andrea Denisse: "Hi folks, I tested this in the #api-alerts-test channel with the @alerts-api-mw-rest-test subteam. The patch contains the prod values." [puppet] - 10https://gerrit.wikimedia.org/r/1194321 (https://phabricator.wikimedia.org/T408145) (owner: 10Andrea Denisse) [22:38:39] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1015.eqiad.wmnet, repooling both afterwards [22:38:42] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling both afterwards [22:38:44] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [22:38:55] FIRING: [4x] SystemdUnitFailed: wdqs-categories.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:22] (03CR) 10Andrea Denisse: "I also updated the Wikitech documentation to use it: https://wikitech.wikimedia.org/wiki/Alertmanager#Sending_alerts_to_Slack" [puppet] - 10https://gerrit.wikimedia.org/r/1194321 (https://phabricator.wikimedia.org/T408145) (owner: 10Andrea Denisse) [22:47:55] (03PS1) 10Cwhite: hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198408 (https://phabricator.wikimedia.org/T406774) [22:48:23] (03PS2) 10Cwhite: hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198408 (https://phabricator.wikimedia.org/T406774) [22:49:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2004-dev.codfw.wmnet with OS trixie [22:50:09] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1016.eqiad.wmnet, repooling both afterwards [22:50:09] (03CR) 10Cwhite: [C:03+2] hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198408 (https://phabricator.wikimedia.org/T406774) (owner: 10Cwhite) [22:50:12] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling both afterwards [22:50:14] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [22:50:40] FIRING: [4x] SystemdUnitFailed: wdqs-categories.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:21] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1017.eqiad.wmnet, repooling both afterwards [23:01:23] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1018.eqiad.wmnet, repooling both afterwards [23:01:26] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [23:03:55] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:07:32] (03PS11) 10Krinkle: varnish: Enable enable_m_redir everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1197695 (https://phabricator.wikimedia.org/T405931) [23:11:59] (03PS1) 10Cwhite: hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198411 (https://phabricator.wikimedia.org/T406774) [23:12:37] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1018.eqiad.wmnet, repooling both afterwards [23:12:39] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1019.eqiad.wmnet, repooling both afterwards [23:12:41] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [23:13:00] (03CR) 10Cwhite: [C:03+2] hiera: block more china unicom and telecom abusers [puppet] - 10https://gerrit.wikimedia.org/r/1198411 (https://phabricator.wikimedia.org/T406774) (owner: 10Cwhite) [23:13:55] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:03] (03PS1) 10Krinkle: wmf-config: Stop sending HTTP purges for mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198412 (https://phabricator.wikimedia.org/T405931) [23:15:13] (03PS3) 10Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408064) [23:22:14] (03PS1) 10Clare Ming: ext.xLab: Implement OverriddenExperiment#setStream() [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198413 [23:22:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198413 (owner: 10Clare Ming) [23:23:32] (03CR) 10BCornwall: [C:03+2] varnish: Enable enable_m_redir everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1197695 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [23:23:48] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1019.eqiad.wmnet, repooling both afterwards [23:23:51] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1020.eqiad.wmnet, repooling both afterwards [23:23:53] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [23:23:55] FIRING: [4x] SystemdUnitFailed: wdqs-categories.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:16] FIRING: [4x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:05] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1020.eqiad.wmnet, repooling both afterwards [23:35:07] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1021.eqiad.wmnet, repooling both afterwards [23:35:10] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [23:35:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:22] (03CR) 10Santiago Faci: [C:03+1] ext.xLab: Implement UnenrolledExperiment#setStream() [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198404 (owner: 10Clare Ming) [23:37:30] (03CR) 10Santiago Faci: [C:03+1] ext.xLab: Implement OverriddenExperiment#setStream() [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198413 (owner: 10Clare Ming) [23:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198414 [23:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198414 (owner: 10TrainBranchBot) [23:41:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198413 (owner: 10Clare Ming) [23:42:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198404 (owner: 10Clare Ming) [23:46:00] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1021.eqiad.wmnet, repooling both afterwards [23:46:03] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling both afterwards [23:46:05] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [23:48:55] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:49:16] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:52:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198414 (owner: 10TrainBranchBot) [23:57:07] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T406920, Update outdated categories info) xfer categories from wdqs1011.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling both afterwards [23:57:12] T406920: deepcategory search fails to show all expected results - https://phabricator.wikimedia.org/T406920 [23:58:55] FIRING: [4x] SystemdUnitFailed: wdqs-categories.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed