[00:06:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:07:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 3.394 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:14:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:15:27] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 4.482 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:18:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:20:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.257 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:23:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:25:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30035 bytes in 3.076 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:28:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:33:27] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 5.270 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:38:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204709 [00:38:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204709 (owner: 10TrainBranchBot) [00:39:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:40:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 2.832 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:51:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:54:23] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:54:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204709 (owner: 10TrainBranchBot) [00:57:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:58:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 6.177 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:00:42] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:08:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 [01:08:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 (owner: 10TrainBranchBot) [01:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:09:39] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.006e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [01:11:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:13:47] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 04s) [01:15:23] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 0.209 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:33:23] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:42:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:01:39] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [02:09:28] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 (owner: 10TrainBranchBot) [02:09:47] (03CR) 10Zabe: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 (owner: 10TrainBranchBot) [02:20:13] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:21:13] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:39:03] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 (owner: 10TrainBranchBot) [02:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:54:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [02:57:27] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.150 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:00:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:03:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 0.759 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:06:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:08:27] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.691 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:12:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:13:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 9.524 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:26:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:32:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:32:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30045 bytes in 9.464 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:33:27] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:34:07] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:34:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:35:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:36:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30045 bytes in 5.856 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:37:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:40:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:47:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 8.595 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:01:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:06:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:32:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:38:27] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.571 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:42:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:47:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 8.602 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:50:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:02:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 9.679 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:05:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:05:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:05:53] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:07:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:08:23] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:19:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 1.737 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:32:16] (03PS3) 10Krinkle: Fix symbolic links [extensions/WikimediaMaintenance] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203281 [05:32:22] (03Abandoned) 10Krinkle: Fix symbolic links [extensions/WikimediaMaintenance] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203281 (owner: 10Krinkle) [05:33:23] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:42:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 2.252 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:42:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:57:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.652 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:00:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:02:04] Deploying MinT/machinetranslation. [06:02:21] (03CR) 10KartikMistry: [C:03+2] machinetranslation: Increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [06:04:12] (03Merged) 10jenkins-bot: machinetranslation: Increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [06:06:48] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:08:34] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:11:31] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 5.801 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:14:24] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:14:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:17:22] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:18:33] !log machinetranslation: Increase replicas (T386371) [06:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:37] T386371: Request capacity increase in preparation for MinT for wiki Readers experiment - https://phabricator.wikimedia.org/T386371 [06:20:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 7.610 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:23:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:26:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11369368 (10Marostegui) @Jclark-ctr did the DIMM arrive in the end? Thanks! [06:27:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:27:53] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11369379 (10Jclark-ctr) @marostegui it arrived yesterday afternoon will be replacing first thing this morning [06:28:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11369380 (10Marostegui) Great news, thank you! [06:35:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T409818 [06:35:15] T409818: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T409818 [06:36:27] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 1.738 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:36:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2161 with weight 0 T409818', diff saved to https://phabricator.wikimedia.org/P85291 and previous config saved to /var/cache/conftool/dbconfig/20251113-063651-fceratto.json [06:38:23] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:40:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:43:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 5.535 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:45:25] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1203790 (https://phabricator.wikimedia.org/T409818) (owner: 10Gerrit maintenance bot) [06:47:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:47:42] !log Starting s8 codfw failover from db2165 to db2161 - T409818 [06:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:46] T409818: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T409818 [06:49:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set s8 codfw as read-only for maintenance - T409818', diff saved to https://phabricator.wikimedia.org/P85292 and previous config saved to /var/cache/conftool/dbconfig/20251113-064929-fceratto.json [06:49:31] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 6.012 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:53:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:53:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2161 to s8 primary and set section read-write T409818', diff saved to https://phabricator.wikimedia.org/P85293 and previous config saved to /var/cache/conftool/dbconfig/20251113-065342-fceratto.json [06:53:46] T409818: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T409818 [06:54:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:56:26] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1203791 (https://phabricator.wikimedia.org/T409818) (owner: 10Gerrit maintenance bot) [06:57:28] !log fceratto@dns1004 START - running authdns-update [06:58:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 1.634 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:58:30] !log fceratto@dns1004 END - running authdns-update [06:59:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:59:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db2165 T409818', diff saved to https://phabricator.wikimedia.org/P85294 and previous config saved to /var/cache/conftool/dbconfig/20251113-065957-fceratto.json [07:00:02] T409818: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T409818 [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T0700) [07:00:05] marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T0700). [07:02:03] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:02:15] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2165 - Upgrading db2165.codfw.wmnet [07:02:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2165 - Upgrading db2165.codfw.wmnet [07:05:04] (03PS1) 10Federico Ceratto: db2165: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1204746 (https://phabricator.wikimedia.org/T406008) [07:05:22] fceratto@cumin1003 major-upgrade (PID 2865916) is awaiting input [07:06:00] (03CR) 10Marostegui: [C:03+1] db2165: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1204746 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [07:06:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:08:02] (03CR) 10Federico Ceratto: [C:03+2] db2165: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1204746 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [07:10:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:10:53] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:18:32] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=x3 [07:18:38] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=x3 [07:18:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1022].eqiad.wmnet with reason: Cloning clouddb1022:s3 [07:19:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 9.473 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:19:42] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=x3 [07:19:45] (03CR) 10Filippo Giunchedi: "> If we add a dependency on a puppetdb it means we can't have a test setup in cloud unless we build and maintain our own local puppetdb in" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [07:19:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1022].eqiad.wmnet with reason: Cloning clouddb1022:s3 [07:20:03] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2165 gradually with 4 steps - Migration of db2165.codfw.wmnet completed [07:22:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:23:10] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): return main to nominal multi-DC size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [07:23:23] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:23:34] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): return next to "idle" size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203574 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [07:23:56] (03PS1) 10Marostegui: check_private_data_report: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204749 (https://phabricator.wikimedia.org/T409557) [07:24:07] (03CR) 10Effie Mouzeli: [C:03+1] rest-gateway: Stop diverting PHP_ENGINE=8.3 to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203573 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [07:24:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 9.865 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:24:46] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204749 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [07:25:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:10] (03CR) 10Effie Mouzeli: [C:03+1] trafficserver: disable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1203569 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [07:27:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:30:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:30:48] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:33:34] (03PS2) 10Giuseppe Lavagetto: cache/haproxy: set x-trusted-request to D for UA-compliant robots [puppet] - 10https://gerrit.wikimedia.org/r/1203054 (https://phabricator.wikimedia.org/T406545) [07:35:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:40:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:41:38] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache/haproxy: set x-trusted-request to D for UA-compliant robots [puppet] - 10https://gerrit.wikimedia.org/r/1203054 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [07:45:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:45:48] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:50:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:50:48] FIRING: [14x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:55:43] FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:43] RESOLVED: [7x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:05:32] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2165 gradually with 4 steps - Migration of db2165.codfw.wmnet completed [08:05:33] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [08:06:53] (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1204609 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [08:08:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [08:10:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [08:14:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [08:15:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [08:20:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11369646 (10MoritzMuehlenhoff) @RobH I've drained the two next hosts: ganeti1027 and ganeti1034 can be migrated next. When these are done and you... [08:20:22] (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from alertmanager access [puppet] - 10https://gerrit.wikimedia.org/r/1204620 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [08:25:33] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update pwstore docs to point to cumin1003 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1204375 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [08:28:02] (03PS1) 10Muehlenhoff: Remove cumin1002 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1204797 (https://phabricator.wikimedia.org/T389380) [08:29:28] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 3.097 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:34:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:36:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30040 bytes in 0.852 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:40:38] (03PS1) 10Muehlenhoff: Remove cumin1002 from mysql root list [puppet] - 10https://gerrit.wikimedia.org/r/1204799 (https://phabricator.wikimedia.org/T389380) [08:44:18] (03CR) 10Marostegui: [C:03+1] Remove cumin1002 from mysql root list [puppet] - 10https://gerrit.wikimedia.org/r/1204799 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [08:44:25] (03PS1) 10Muehlenhoff: Bump changelog for 1.0.4 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1204800 [08:52:47] (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from mysql root list [puppet] - 10https://gerrit.wikimedia.org/r/1204799 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [08:53:08] (03CR) 10Matthias Mullie: Reduce number of bucketsizes for MediaViewer (group0) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [08:57:23] (03CR) 10David Caro: [C:03+2] maintain-dbusers: add stat for last run [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [08:58:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:59:20] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet [08:59:41] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1001.eqiad.wmnet [09:00:05] andre and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T0900) [09:01:19] (03CR) 10David Caro: [C:03+2] maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [09:01:28] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet [09:01:45] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1001.eqiad.wmnet [09:02:59] (03Merged) 10jenkins-bot: maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [09:04:37] jmm@cumin2002 netbox (PID 264227) is awaiting input [09:05:49] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:08:48] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204803 (https://phabricator.wikimedia.org/T408272) [09:08:50] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204803 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [09:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:44] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204803 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [09:13:59] (03PS1) 10Effie Mouzeli: prometheus: add recording rules for phpfpm_workers:active_percent [puppet] - 10https://gerrit.wikimedia.org/r/1204804 [09:19:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1204639 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking) [09:20:11] Help wanted: Train deployment to group2 fails with an issue in the docker-registry [09:20:23] 09:10:47 [mediawiki-publish-83] received unexpected HTTP status: 500 Internal Server Error [09:20:24] subprocess.CalledProcessError: Command '['sudo', '/usr/local/bin/docker-pusher', '-q', 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-11-13-090954-publish-83']' returned non-zero exit status 1. [09:37:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11369925 (10cmooney) @papaul I'm really getting sick of Juniper on this one. Personally I suspect the input voltage/frequency (i.e. our feed... [09:39:41] (03CR) 10Jaime Nuche: "hi there, I was the relenger asked about this:" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [09:42:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:19] (03CR) 10Jaime Nuche: "I'm thinking now that I didn't read the commit message right. It seems this change is only about changing the location for uploads and I'm" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [09:46:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:49:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:50:48] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet [09:50:50] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [09:54:31] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1001.eqiad.wmnet - fceratto@cumin1003" [09:54:59] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1001.eqiad.wmnet - fceratto@cumin1003" [09:54:59] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:59] !log fceratto@cumin1003 START - Cookbook sre.dns.wipe-cache db-test1001.eqiad.wmnet on all recursors [09:55:03] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test1001.eqiad.wmnet on all recursors [09:55:35] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1001.eqiad.wmnet - fceratto@cumin1003" [09:55:39] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1001.eqiad.wmnet - fceratto@cumin1003" [09:56:57] !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db-test1001.eqiad.wmnet with OS trixie [10:00:18] (03PS1) 10Marostegui: db1264: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1204807 (https://phabricator.wikimedia.org/T407941) [10:00:29] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog for 1.0.4 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1204800 (owner: 10Muehlenhoff) [10:01:50] (03CR) 10Marostegui: "Host green in icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1204807 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui) [10:01:51] (03CR) 10Marostegui: [C:03+2] db1264: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1204807 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui) [10:03:49] (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on all eqiad1 cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1204623 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [10:03:56] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.2 refs T408272 [10:04:00] T408272: 1.46.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T408272 [10:07:35] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1264 slowly with 10 steps - Pooling for the first time [10:07:45] !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1001.eqiad.wmnet with reason: host reimage [10:08:47] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1264 slowly with 10 steps - Pooling for the first time [10:08:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1264 slowly with 10 steps - Pooling for the first time [10:10:15] (03PS1) 10Marostegui: installserver: Do not format es1056 [puppet] - 10https://gerrit.wikimedia.org/r/1204809 [10:11:21] (03PS2) 10Muehlenhoff: Remove cumin1002 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1204797 (https://phabricator.wikimedia.org/T389380) [10:12:25] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es1056 [puppet] - 10https://gerrit.wikimedia.org/r/1204809 (owner: 10Marostegui) [10:12:49] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1001.eqiad.wmnet with reason: host reimage [10:15:47] (03PS1) 10David Caro: maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810 [10:16:42] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 [10:16:51] (03PS2) 10David Caro: maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810 [10:22:04] (03PS1) 10Marostegui: mariadb: Productionize clouddb1023 [puppet] - 10https://gerrit.wikimedia.org/r/1204812 (https://phabricator.wikimedia.org/T409557) [10:23:58] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [10:25:10] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:32] (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 as Homer git peer [puppet] - 10https://gerrit.wikimedia.org/r/1204622 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [10:26:50] (03CR) 10Brouberol: [C:03+2] deployment_server: migrate mediawiki-dumps-legacy to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203578 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [10:29:00] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 64.58 ms [10:29:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm [10:30:12] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 64.39 ms [10:31:52] (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on eqiad1 cloudnets [puppet] - 10https://gerrit.wikimedia.org/r/1204624 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [10:33:35] (03PS1) 10Mvolz: Remove deprecated parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204813 (https://phabricator.wikimedia.org/T361576) [10:33:44] (03CR) 10CI reject: [V:04-1] Remove deprecated parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204813 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [10:34:14] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test1001.eqiad.wmnet with OS trixie [10:34:14] !log fceratto@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test1001.eqiad.wmnet [10:37:53] (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on remaining eqiad1 nodes [puppet] - 10https://gerrit.wikimedia.org/r/1204625 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [10:38:49] (03CR) 10Brouberol: [C:03+1] Enable an oauth2-proxy for growthbook frontend and api pods (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [10:41:50] (03PS1) 10Majavah: P:openstack: neutron: Set MTU setting to 9000 in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1204814 (https://phabricator.wikimedia.org/T409544) [10:44:30] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::cloud_private_subnet: Cleanup feature flag for jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1204626 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [10:44:40] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 12 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1204626 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [10:44:51] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloud_private_subnet: Cleanup feature flag for jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1204626 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [10:45:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1204814 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah) [10:46:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [10:46:44] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:48:14] (03PS6) 10Majavah: P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 [10:49:19] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db-test1001.eqiad.wmnet with reason: Cloning [10:49:54] !log installing libfcgi security updates [10:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [10:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:52:56] (03CR) 10Filippo Giunchedi: [C:03+1] P:openstack: neutron: Set MTU setting to 9000 in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1204814 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah) [10:53:12] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: neutron: Set MTU setting to 9000 in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1204814 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah) [10:53:53] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7607/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah) [10:55:25] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Traffic: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019 (10Vgutierrez) 03NEW [10:55:35] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Traffic: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11370189 (10Vgutierrez) p:05Triage→03High [10:56:14] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Traffic, 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11370190 (10Vgutierrez) [10:56:58] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 64.50 ms [10:57:18] 06SRE, 06Data-Platform-SRE, 06Traffic, 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11370193 (10Vgutierrez) [10:57:39] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020 (10jcrespo) 03NEW [10:57:41] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Traffic, 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11370205 (10Vgutierrez) [10:58:03] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020#11370206 (10jcrespo) [10:58:58] (03PS2) 10Tchanders: Freeze LiquidThreads on enwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) [10:59:13] !log upgrade Envoy on idm* T405808 [10:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:18] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [10:59:22] (03CR) 10Tchanders: "We have the go-ahead from comms" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1100) [11:00:12] (03CR) 10Filippo Giunchedi: [C:03+1] maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810 (owner: 10David Caro) [11:01:49] (03CR) 10David Caro: [C:03+2] maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810 (owner: 10David Caro) [11:02:59] (03Merged) 10jenkins-bot: maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810 (owner: 10David Caro) [11:03:23] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11370232 (10MGerlach) [11:03:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11370233 (10MGerlach) [11:07:20] (03PS1) 10Muehlenhoff: Setup cumin1002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1204818 (https://phabricator.wikimedia.org/T389380) [11:08:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bookworm [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:20:03] (03CR) 10Muehlenhoff: [C:03+2] Setup cumin1002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1204818 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:20:56] (03PS1) 10David Caro: maintain_dbusers: initialize the stats [puppet] - 10https://gerrit.wikimedia.org/r/1204820 [11:21:27] (03PS1) 10Marco Fossati: ImageBrowsing: add tier 2 experiment [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204821 (https://phabricator.wikimedia.org/T409739) [11:22:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204821 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati) [11:22:46] (03PS1) 10Marco Fossati: xLab: add tier 2 experiment to ImageBrowsing [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204822 (https://phabricator.wikimedia.org/T409739) [11:23:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204822 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati) [11:23:23] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:25:37] (03CR) 10Muehlenhoff: [C:03+1] "This can be merged now, cumin1002 has been moved to the insetup role for eventual decom" [puppet] - 10https://gerrit.wikimedia.org/r/1204628 (https://phabricator.wikimedia.org/T389380) (owner: 10Marostegui) [11:26:00] (03CR) 10Marostegui: [C:03+2] wmf_root_client.pp: Remove cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/1204628 (https://phabricator.wikimedia.org/T389380) (owner: 10Marostegui) [11:26:49] (03CR) 10David Caro: [V:03+1] "Tested in cloudcontrol1007:" [puppet] - 10https://gerrit.wikimedia.org/r/1204820 (owner: 10David Caro) [11:28:30] (03PS1) 10David Caro: maintain_dbusers: fix MaintainDBUsersManyErrors expression [alerts] - 10https://gerrit.wikimedia.org/r/1204827 [11:28:30] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:29:11] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:30:27] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:30:47] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:30:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders) [11:32:07] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:32:22] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:35:33] (03PS2) 10Muehlenhoff: Enable nftables on cluster::management on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1204368 (https://phabricator.wikimedia.org/T389380) [11:36:36] (03PS1) 10Kosta Harlan: hCaptcha: Update config for addurl trigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204830 (https://phabricator.wikimedia.org/T409957) [11:38:23] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:41:17] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Kanban Board), 07Essential-Work: Record api-user-agent in metrics; filter by MediaWikiJs - https://phabricator.wikimedia.org/T402385#11370307 (10Mvolz) 05Open→03Resolved [11:44:19] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:45:13] (03PS1) 10Klausman: homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832 [11:45:19] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:45:38] (03CR) 10FNegri: maintain_dbusers: fix MaintainDBUsersManyErrors expression (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204827 (owner: 10David Caro) [11:45:59] (03PS2) 10Klausman: homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832 [11:46:38] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11370320 (10Mvolz) The dashboard now has this: https://grafana.wikimedia.org/d/NJkCVermz/c... [11:46:39] (03PS3) 10Klausman: homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832 [11:47:16] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:48:23] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:51:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204368 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:53:46] (03PS3) 10Muehlenhoff: Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) [11:55:09] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:59:09] (03PS2) 10David Caro: maintain_dbusers: fix MaintainDBUsersManyErrors expression [alerts] - 10https://gerrit.wikimedia.org/r/1204827 [11:59:16] (03CR) 10David Caro: maintain_dbusers: fix MaintainDBUsersManyErrors expression (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204827 (owner: 10David Caro) [12:00:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:06:09] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Disable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204619 (owner: 10Clément Goubert) [12:07:09] !log uploaded wmf-laptop 1.0.4 to apt.wikimedia.org [12:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:10] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020#11370467 (10jcrespo) 🤨 {F70174897} [12:32:46] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204849 [12:33:21] !log installing bind security updates (client-side tools/libs only) [12:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:47] (03CR) 10Ladsgroup: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [12:35:48] (03CR) 10Muehlenhoff: [C:03+2] Fix alias [puppet] - 10https://gerrit.wikimedia.org/r/1204844 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:37:36] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020#11370496 (10jcrespo) Garage also doesn't support TLS/HTTPS be default, it requires a reverse proxy: https://garagehq... [12:39:38] (03PS1) 10Clément Goubert: rest-gateway: Set ratelimit key_prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204850 [12:42:40] (03CR) 10Daniel Kinzler: [C:03+1] "yes, please" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204850 (owner: 10Clément Goubert) [12:45:29] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Set ratelimit key_prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204850 (owner: 10Clément Goubert) [12:47:26] (03Merged) 10jenkins-bot: rest-gateway: Set ratelimit key_prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204850 (owner: 10Clément Goubert) [12:48:26] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:48:43] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:49:51] (03PS2) 10Effie Mouzeli: prometheus: add temp recording rules for phpfpm_workers:active_percent [puppet] - 10https://gerrit.wikimedia.org/r/1204804 [12:50:30] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists bawiktionary; drop database if exists chwikimedia; drop database if exists closed_zh_twwiki; drop database if exists comcomwiki; (T297297) [12:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:34] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [12:50:45] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:50:55] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:50:58] (03CR) 10David Caro: [C:03+2] maintain_dbusers: fix MaintainDBUsersManyErrors expression [alerts] - 10https://gerrit.wikimedia.org/r/1204827 (owner: 10David Caro) [12:52:10] (03Merged) 10jenkins-bot: maintain_dbusers: fix MaintainDBUsersManyErrors expression [alerts] - 10https://gerrit.wikimedia.org/r/1204827 (owner: 10David Caro) [12:55:31] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Upgrade [12:57:12] (03CR) 10Kamila Součková: [C:03+2] hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157 (owner: 10BryanDavis) [12:57:53] (03CR) 10Kamila Součková: [C:03+2] hcaptcha: Respond with HTTP 405 to disallowed methods [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis) [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1300) [13:01:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:01:05] !log installing amd64-microcode security updates [13:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:43] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Upgrade [13:06:16] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:07:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:24] (03PS1) 10Kamila Součková: hcaptcha proxy: add missing ; in nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1204863 [13:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:10:41] (03CR) 10JMeybohm: [C:03+1] hcaptcha proxy: add missing ; in nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1204863 (owner: 10Kamila Součková) [13:11:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11370609 (10Jclark-ctr) [13:11:10] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409980#11370611 (10Jclark-ctr) →14Duplicate dup:03T409938 [13:11:12] (03CR) 10Kamila Součková: [C:03+2] hcaptcha proxy: add missing ; in nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1204863 (owner: 10Kamila Součková) [13:12:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11370618 (10Jclark-ctr) [13:12:17] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409967#11370620 (10Jclark-ctr) →14Duplicate dup:03T409938 [13:12:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:35] (03PS1) 10Brouberol: airflow: release new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204864 (https://phabricator.wikimedia.org/T408711) [13:16:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:17:52] (03CR) 10Btullis: [C:03+1] airflow: release new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204864 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [13:19:11] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [13:19:50] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Upgrade [13:21:08] (03PS1) 10Clément Goubert: rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) [13:23:49] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [13:24:31] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [13:25:25] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204849 (owner: 10PipelineBot) [13:25:48] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204666 (owner: 10PipelineBot) [13:26:21] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203434 (owner: 10PipelineBot) [13:26:46] (03CR) 10Ladsgroup: [C:03+1] mariadb: Productionize clouddb1023 [puppet] - 10https://gerrit.wikimedia.org/r/1204812 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [13:26:54] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize clouddb1023 [puppet] - 10https://gerrit.wikimedia.org/r/1204812 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [13:27:09] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204849 (owner: 10PipelineBot) [13:28:43] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:29:04] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Upgrade [13:29:10] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:29:22] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:30:09] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:30:34] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [13:31:12] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:31:39] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:31:57] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:32:11] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:32:29] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:33:27] (03PS1) 10KartikMistry: Update Recommendation API to 2025-11-10-154629-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204868 (https://phabricator.wikimedia.org/T403730) [13:35:16] PROBLEM - Kafka broker TLS certificate validity on kafka-main1006 is CRITICAL: SSL CRITICAL - Certificate kafka-main1006.eqiad.wmnet valid until 2025-11-20 13:35:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [13:36:51] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370719 (10Jclark-ctr) @marostegui Memory has been replaced server is back up and all yours Thank you [13:36:55] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370720 (10Jclark-ctr) 05Open→03Resolved [13:37:01] (03CR) 10DCausse: Add makeTargetDir function to create target directory (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [13:37:33] (03CR) 10DCausse: [C:03+1] Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [13:37:51] (03CR) 10Sbisson: [C:03+1] Update Recommendation API to 2025-11-10-154629-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204868 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry) [13:38:42] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-11-10-154629-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204868 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry) [13:39:12] (03PS1) 10Dpogorzelski: ml-services: add cassandra endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204869 (https://phabricator.wikimedia.org/T409414) [13:40:38] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-11-10-154629-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204868 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry) [13:40:45] Updating recommendation-api .. [13:42:01] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11370726 (10MoritzMuehlenhoff) [13:42:09] (03CR) 10DCausse: Add makeTargetDir function to create target directory (032 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [13:42:11] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Looks great, thank you! Can't wait to test it :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204869 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski) [13:42:37] (03CR) 10Dpogorzelski: [C:03+2] ml-services: add cassandra endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204869 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski) [13:42:39] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:43:07] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370728 (10Marostegui) Thank you - I will reclone the host [13:44:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1241.eqiad.wmnet onto db1262.eqiad.wmnet [13:44:04] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1241 - Depool db1241.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003 [13:44:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370732 (10ops-monitoring-bot) Started cloning db1241.eqiad.wmnet to db1262.eqiad.wmnet - marostegui@cumin1003 [13:44:20] (03Merged) 10jenkins-bot: ml-services: add cassandra endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204869 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski) [13:44:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1241 - Depool db1241.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003 [13:44:41] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370734 (10ops-monitoring-bot) Completed depool of db1241 - Depool db1241.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003 - marostegui@cumin... [13:44:49] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:50:34] (03CR) 10Brouberol: [C:03+2] airflow: release new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204864 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [13:51:42] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:51:52] (03CR) 10Majavah: [C:03+1] maintain_dbusers: initialize the stats [puppet] - 10https://gerrit.wikimedia.org/r/1204820 (owner: 10David Caro) [13:55:45] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:55:57] (03CR) 10David Caro: [V:03+1 C:03+2] maintain_dbusers: initialize the stats [puppet] - 10https://gerrit.wikimedia.org/r/1204820 (owner: 10David Caro) [13:58:23] (03CR) 10DCausse: [C:03+1] "lgtm, one nit about starting to use locally scoped var in bash functions" [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [13:58:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370745 (10Jclark-ctr) Swapped lswtest on Tuesday with the failed switch in D6, cabled it, and handed it over to Cathal for setup. Today, re... [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1400). [14:00:05] mfossati and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:16] edsanders: do you want to start with your config change? [14:00:17] hello! [14:00:22] (03CR) 10DCausse: [C:03+1] "nice!" [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [14:00:33] and then we can let the gate-and-submit for the backports run during that [14:00:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370748 (10cmooney) >>! In T409731#11370745, @Jclark-ctr wrote: > Swapped lswtest on Tuesday with the failed switch in D6, cabled it, and han... [14:03:24] hey, yeah [14:03:28] Lucas_WMDE: I can self-deploy! [14:03:35] ok! [14:03:50] I’d still suggest edsanders goes first, just because your CI will probably take a couple of minutes anyway :) [14:04:05] sure! [14:04:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders) [14:04:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370749 (10Jclark-ctr) 05Open→03Resolved a:05cmooney→03Jclark-ctr [14:05:11] (03Merged) 10jenkins-bot: Freeze LiquidThreads on enwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders) [14:05:58] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1203028|Freeze LiquidThreads on enwikinews (T406717)]] [14:06:02] T406717: Convert LQT pages on enwikinews to Flow - https://phabricator.wikimedia.org/T406717 [14:08:33] !log esanders@deploy2002 tchanders, esanders: Backport for [[gerrit:1203028|Freeze LiquidThreads on enwikinews (T406717)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:19] !log esanders@deploy2002 tchanders, esanders: Continuing with sync [14:10:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11370790 (10Jclark-ctr) Checked for updates. Parts will be available to ship on Thu, Nov 13, 2025. @BTullis they should arrive Friday pending no delays... [14:10:56] mfossati: are you going to deploy your backports together or separately? [14:11:37] Lucas_WMDE: together is fine [14:11:50] ok, then I’ll just +2 them to start the build [14:12:02] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204821 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati) [14:12:06] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204822 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati) [14:12:44] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org [14:12:44] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy4002.wikimedia.org [14:13:00] Lucas_WMDE: I'm on SpiderPig, usually it takes care of +2s. Is it aware of this? [14:13:19] (03Merged) 10jenkins-bot: ImageBrowsing: add tier 2 experiment [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204821 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati) [14:13:58] (03Merged) 10jenkins-bot: xLab: add tier 2 experiment to ImageBrowsing [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204822 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati) [14:14:35] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203028|Freeze LiquidThreads on enwikinews (T406717)]] (duration: 08m 37s) [14:14:39] T406717: Convert LQT pages on enwikinews to Flow - https://phabricator.wikimedia.org/T406717 [14:14:55] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11370807 (10AndrewTavis_WMDE) [14:15:37] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2203.codfw.wmnet [14:16:22] Lucas_WMDE: I'm ready to go, but please let me know if SpiderPig might conflict with your +2s [14:16:31] !log kamila@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2203.codfw.wmnet [14:16:33] mfossati: you’re good to go [14:16:41] (also those builds were faster than I expected, nice ^^) [14:16:48] all right, thanks [14:17:27] !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1204821|ImageBrowsing: add tier 2 experiment (T409739)]], [[gerrit:1204822|xLab: add tier 2 experiment to ImageBrowsing (T409739)]] [14:17:31] T409739: ImageBrowsing: launch the A/B test on English Wikipedia - https://phabricator.wikimedia.org/T409739 [14:19:27] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11370812 (10VRiley-WMF) 05Open→03Resolved This is completed. [14:19:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:19:53] !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1204821|ImageBrowsing: add tier 2 experiment (T409739)]], [[gerrit:1204822|xLab: add tier 2 experiment to ImageBrowsing (T409739)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:20:04] checking [14:20:36] 06SRE, 06Infrastructure-Foundations, 10netops: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11370815 (10cmooney) 05Open→03Resolved So this has bounced a few times since, however it is relatively stable.... [14:21:20] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [14:24:19] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://phabricator.wikimedia.org/T407510#11370825 (10cmooney) I just resolved T407578 on this one. I'll keep an eye on it though and if it gets worse we may ne... [14:24:38] !log homer lsw1-c6-codfw* commit 're-adding failed host -- T408004' [14:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:42] T408004: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004 [14:25:17] (03CR) 10Bking: [C:03+2] Configure reprepro to mirror upstream opensearch2 and opensearch3 repos [puppet] - 10https://gerrit.wikimedia.org/r/1204639 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking) [14:25:20] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2203.codfw.wmnet [14:25:22] !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2203.codfw.wmnet [14:25:50] please hold on, I'm exhaustively checking all wikis where the experiment is deployed :-) [14:25:52] (03CR) 10Tiziano Fogli: [C:03+1] pontoon: introduce puppet::hosts function [puppet] - 10https://gerrit.wikimedia.org/r/1204360 (https://phabricator.wikimedia.org/T409905) (owner: 10Filippo Giunchedi) [14:26:00] (03CR) 10Tiziano Fogli: [C:03+1] pontoon: inject netbox metadata for stack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204361 (https://phabricator.wikimedia.org/T409905) (owner: 10Filippo Giunchedi) [14:26:12] (03CR) 10Itamar Givon: Add makeTargetDir function to create target directory (033 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [14:26:18] (03PS1) 10Bartosz Wójtowicz: kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) [14:27:43] mfossati: ALL_the_things.png [14:28:32] (03CR) 10Dpogorzelski: [C:03+1] kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:29:13] it works (TM) [14:29:19] !log mfossati@deploy2002 mfossati: Continuing with sync [14:29:49] (03CR) 10Tiziano Fogli: [C:03+1] pontoon: clean puppet certs on host destroy [puppet] - 10https://gerrit.wikimedia.org/r/1204370 (https://phabricator.wikimedia.org/T409912) (owner: 10Filippo Giunchedi) [14:30:05] (03PS1) 10DCausse: cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) [14:30:18] (03CR) 10CI reject: [V:04-1] kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:30:57] (03CR) 10CI reject: [V:04-1] cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse) [14:33:38] !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204821|ImageBrowsing: add tier 2 experiment (T409739)]], [[gerrit:1204822|xLab: add tier 2 experiment to ImageBrowsing (T409739)]] (duration: 16m 11s) [14:33:42] T409739: ImageBrowsing: launch the A/B test on English Wikipedia - https://phabricator.wikimedia.org/T409739 [14:33:57] Lucas_WMDE: all done here [14:34:04] \o/ [14:34:09] !log UTC afternoon backport+config window done [14:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:15] very relaxed window for me ;) thanks everyone! [14:34:39] thank you mate :-) [14:35:17] (03PS2) 10Bartosz Wójtowicz: kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) [14:35:46] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:39:47] (03CR) 10Itamar Givon: Refactor moveLinkFile and putDumpChecksums (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [14:41:07] (03CR) 10Elukey: [C:03+1] Add missing Hiera entries for maps/staging [puppet] - 10https://gerrit.wikimedia.org/r/1204842 (owner: 10Muehlenhoff) [14:41:26] (03CR) 10Elukey: [C:03+1] Remove cumin1002 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1204797 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [14:41:33] !log Ran `foreachwikiindblist checkuser-suggested-investigations.dblist extensions/CheckUser/maintenance/populateSicUrlIdentifier.php` [14:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:05] !log Ran `foreachwikiindblist checkuser-suggested-investigations.dblist extensions/CheckUser/maintenance/populateSicUrlIdentifier.php` for T409564 [14:42:06] (03PS3) 10Bartosz Wójtowicz: kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) [14:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:09] T409564: Suggested investigations: Populate sic_url_identifier for existing cusi_case rows - https://phabricator.wikimedia.org/T409564 [14:42:28] (03PS1) 10Tiziano Fogli: check_icinga: add flags to suppress notifications/pages [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) [14:42:28] (03CR) 10Tiziano Fogli: "I’m sorry, I forgot to run the formatter in a separate commit. I’ve marked the actual changes with a “real change” comment here in Gerrit " [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [14:44:27] (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1204797 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [14:46:24] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:46:56] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410041 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:47:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041 (10ops-monitoring-bot) 03NEW [14:47:35] (03CR) 10Elukey: [C:03+1] kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:47:55] (03CR) 10Bartosz Wójtowicz: [C:03+2] kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:47:56] (03CR) 10Elukey: [C:04-1] "sorrryyy chart bump!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:49:45] (03CR) 10Muehlenhoff: [C:03+2] Add missing Hiera entries for maps/staging [puppet] - 10https://gerrit.wikimedia.org/r/1204842 (owner: 10Muehlenhoff) [14:50:09] !log Update Recommendation API to 2025-11-10-154629-production (T403730) [14:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:13] T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730 [14:50:23] (03PS4) 10Bking: opensearch-cluster: raise defaults to match design doc, disable upstream monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501) [14:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:51:30] (03CR) 10Muehlenhoff: [C:03+2] Enable nftables on cluster::management on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1204368 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [14:51:54] (03CR) 10Andrea Denisse: [C:03+1] "Overall LGTM, I just left a non blocking comment." [puppet] - 10https://gerrit.wikimedia.org/r/1204804 (owner: 10Effie Mouzeli) [14:51:54] (03Merged) 10jenkins-bot: kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:51:57] (03PS5) 10Bking: opensearch-cluster: raise defaults to match design doc, disable upstream monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501) [14:52:11] (03PS4) 10Muehlenhoff: Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) [14:52:59] (03PS1) 10Bartosz Wójtowicz: kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) [14:53:45] (03CR) 10Dpogorzelski: [C:03+1] kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:54:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:56:03] 06SRE, 07SRE-Unowned, 10Maps: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528#11370924 (10elukey) Created a new bucket with `swift post` and the Tegola AUTH credentials on thanos-fe1004: ` root@thanos-fe1004:~# swift stat tegola-swift-staging-codfw-v001 Account: AUT... [14:56:23] (03CR) 10Elukey: [C:03+1] kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:57:19] (03CR) 10Bartosz Wójtowicz: [C:03+2] kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [14:57:55] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [14:59:34] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:00:20] !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [15:01:25] (03Merged) 10jenkins-bot: kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz) [15:02:04] !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [15:02:48] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPv6 address from db-test1001.eqiad.wmnet - fceratto@cumin1003" [15:02:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:03:50] !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [15:05:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:05:15] !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [15:05:20] !log kamila@cumin1003 START - Cookbook sre.dns.netbox [15:05:52] fceratto@cumin1003 netbox (PID 3325688) is awaiting input [15:08:00] !log kamila@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:09:38] (03CR) 10Brouberol: [C:03+1] "Technically, this sets the _default_ resource requests/limits. If you really wanted to defined minimum resources, you'd need to define lim" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501) (owner: 10Bking) [15:09:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:09:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [15:09:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:11:23] there's a change for hosts/wikikube-worker2203.yaml pending commit by cookbooks.sre.dns.netbox [15:11:38] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPv6 address from db-test1001.eqiad.wmnet - fceratto@cumin1003" [15:11:39] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:23:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:23:49] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab1003.wikimedia.org with reason: Update [15:24:37] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab2002.wikimedia.org with reason: Update [15:26:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:27:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047 (10cmooney) 03NEW p:05Triage→03Medium [15:28:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1530) [15:30:57] (03PS1) 10Brouberol: pg-airlfow-main: upscale the CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204899 [15:32:31] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371150 (10cmooney) [15:32:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371154 (10cmooney) [15:33:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:33:19] (03CR) 10Btullis: [C:03+1] pg-airlfow-main: upscale the CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204899 (owner: 10Brouberol) [15:33:27] (03CR) 10Brouberol: [C:03+2] pg-airlfow-main: upscale the CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204899 (owner: 10Brouberol) [15:33:58] !log bking@apt1002 sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia /home/bking/wmf-opensearch-search-plugins-1.3.20+12-bullseye/wmf-opensearch-search-plugins_1.3.20+12_amd64.changes T407520 [15:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:02] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [15:34:10] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371173 (10cmooney) [15:34:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [15:34:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [15:34:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:36:30] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [15:36:57] (03PS1) 10Muehlenhoff: Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) [15:37:29] (03CR) 10CI reject: [V:04-1] Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:37:53] (03PS2) 10Muehlenhoff: Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) [15:38:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:39:25] (03PS3) 10Muehlenhoff: Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) [15:40:15] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371233 (10Reedy) [15:41:37] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [15:50:36] (03PS2) 10Itamar Givon: Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) [15:52:39] (03PS2) 10Itamar Givon: Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) [15:53:52] Is someone looking at wikifeeds? [15:54:01] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11371274 (10VRiley-WMF) @BTullis we have recieved the drive for this unit. Is there a time for us to replace this? [15:56:36] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:57:33] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520 [15:57:37] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [15:58:15] nemo-yiannis: does wikifeeds hit recommendation API in the backend? [15:58:32] RESOLVED: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:58:44] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply wmf-opensearch-search-plugins update, other updates (see also T407110) - bking@cumin2002 - T407520 [15:59:01] (03PS3) 10Itamar Givon: Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) [15:59:20] !log eqiad c/d migrations window start [15:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:55] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ml-serve1003.eqiad.wmnet with reason: C/D Migration [16:00:05] andre and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1600). [16:00:38] (03PS1) 10Muehlenhoff: Remove a lot of historical stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1204913 (https://phabricator.wikimedia.org/T381565) [16:01:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:01:37] RESOLVED: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [16:01:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11371319 (10Papaul) After swapping both PEM 2 and 3 ` re0.cr1-codfw> show chassis environment pem PEM 0 status: State... [16:01:47] !log roll restarting mobileapps in codfw [16:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:53] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: sync [16:01:58] (03PS2) 10Itamar Givon: Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) [16:02:08] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync [16:02:10] (03PS1) 10Muehlenhoff: Remove the new unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) [16:02:12] (03PS1) 10Muehlenhoff: Add stub secrets for the staging role [labs/private] - 10https://gerrit.wikimedia.org/r/1204915 (https://phabricator.wikimedia.org/T409528) [16:02:20] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 311 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1046, active_shards: 1559, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 311, delayed_unassigned_shards: 0 [16:02:20] _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.36898395721926 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:02:20] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 311 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1046, active_shards: 1559, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 311, delayed_unassigned_shards: 0 [16:02:20] _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.36898395721926 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:02:21] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 276 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1376, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 276, delayed_unassigned_shard [16:02:21] mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.29297820823246 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:02:21] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 311 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1046, active_shards: 1559, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 311, delayed_unassigned_shards: 0 [16:03:06] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync [16:03:56] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1027.eqiad.wmnet with reason: C/D Migration [16:04:26] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync [16:04:37] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520 [16:04:41] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [16:04:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:05:20] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1500, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 151, delayed_unassigned_shards: 0, number_of_pendin [16:05:20] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 90.79903147699758 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:05:20] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1741, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 125, delayed_unassigned_shards: 0, number_of_pending_t [16:05:20] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.10160427807487 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:05:20] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1741, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 125, delayed_unassigned_shards: 0, number_of_pending_t [16:05:20] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.10160427807487 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:06:14] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1034.eqiad.wmnet with reason: C/D Migration [16:06:19] ^^ those opensearch alerts are expected, I thought our cook-book would set an alert suppression. Will have to look at that later [16:06:33] (03PS1) 10Muehlenhoff: Remove obsolete grants file [puppet] - 10https://gerrit.wikimedia.org/r/1204916 (https://phabricator.wikimedia.org/T381565) [16:06:53] marostegui@cumin1003 clone (PID 3241961) is awaiting input [16:07:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204916 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:08:30] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:52] (03PS2) 10Muehlenhoff: Add stub secrets for the staging role [labs/private] - 10https://gerrit.wikimedia.org/r/1204915 (https://phabricator.wikimedia.org/T409528) [16:08:57] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [16:09:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [16:09:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [16:09:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:10:10] (03PS2) 10DCausse: cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) [16:10:44] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [16:11:09] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub secrets for the staging role [labs/private] - 10https://gerrit.wikimedia.org/r/1204915 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff) [16:11:11] (03PS3) 10DCausse: cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) [16:11:20] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1041, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0 [16:11:20] _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.31550802139037 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:11:20] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 275 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1377, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 275, delayed_unassigned_shard [16:11:20] mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.35351089588377 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:11:20] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1041, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0 [16:11:30] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1028.eqiad.wmnet with reason: C/D Migration [16:12:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:12:36] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528#11371360 (10elukey) After a chat with Moritz we realized that the better path is probably to create another account for staging, and create the new container in there. In this way we fully... [16:13:25] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:20] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1634, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 228, delayed_unassigned_shards: 0, number_of_pending_t [16:14:20] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.37967914438502 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:14:20] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1426, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 224, delayed_unassigned_shards: 0, number_of_pendin [16:14:20] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 39, active_shards_percent_as_number: 86.31961259079904 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:14:20] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1426, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 224, delayed_unassigned_shards: 0, number_of_pendin [16:15:26] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [16:16:35] !log eqiad c/d migration project: ganeti hosts moving today with proper full drains [16:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:54] !log installing cups security updates [16:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:59] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3001.wikimedia.org [16:19:00] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [16:19:36] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org [16:19:37] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet [16:20:01] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1047.eqiad.wmnet with reason: C/D Migration [16:20:20] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1045, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0 [16:20:20] _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.31550802139037 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:20:21] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1045, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0 [16:20:21] _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.31550802139037 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:20:21] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 269 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 808, active_shards: 1348, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 269, delayed_unassigned_shards: [16:20:37] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [16:21:57] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet [16:22:20] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1466, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 185, delayed_unassigned_shards: 0, number_of_pendin [16:22:20] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 88.7409200968523 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:22:20] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1444, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 171, delayed_unassigned_shards: 0, number_of_pending_ta [16:22:20] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 89.30117501546073 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:22:20] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1686, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 178, delayed_unassigned_shards: 0, number_of_pending_t [16:22:20] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 150, active_shards_percent_as_number: 90.16042780748663 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:22:22] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1048.eqiad.wmnet with reason: C/D Migration [16:22:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:23:00] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3001.wikimedia.org - sukhe@cumin1003" [16:23:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3001.wikimedia.org - sukhe@cumin1003" [16:23:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:05] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3001.wikimedia.org on all recursors [16:23:08] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3001.wikimedia.org on all recursors [16:23:23] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [16:23:25] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:33] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [16:23:38] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3001.wikimedia.org - sukhe@cumin1003" [16:23:42] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3001.wikimedia.org - sukhe@cumin1003" [16:23:54] !incidents [16:23:54] 6998 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [16:24:17] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3001.wikimedia.org with OS trixie [16:24:28] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371439 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [16:24:55] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:09] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet [16:26:22] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1037.eqiad.wmnet with reason: C/D Migration [16:26:48] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003" [16:26:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003" [16:26:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:26:53] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors [16:26:56] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors [16:27:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11371462 (10RobH) ganeti1028 ganeti1047 ganeti1048 ganeti1037 All migrated to new switch port after having the drain command run successfully aga... [16:27:28] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003" [16:27:32] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003" [16:27:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:28:15] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7001.wikimedia.org with OS trixie [16:28:19] (03CR) 10Tjones: [C:03+1] "LGTM. (I don't have +2 in this repo.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse) [16:28:20] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 275 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1377, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 275, delayed_unassigned_shard [16:28:20] mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.35351089588377 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:28:20] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1043, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0 [16:28:20] _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.31550802139037 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:28:22] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 275 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1377, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 275, delayed_unassigned_shard [16:28:22] mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.35351089588377 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:28:22] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 808, active_shards: 1347, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 270, delayed_unassigned_shards: [16:28:22] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.30241187384044 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:28:22] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 808, active_shards: 1347, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 270, delayed_unassigned_shards: [16:28:25] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [16:29:30] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet [16:29:55] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:20] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1636, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 226, delayed_unassigned_shards: 0, number_of_pending_t [16:31:20] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.48663101604278 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:20] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1411, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 240, delayed_unassigned_shards: 0, number_of_pendin [16:31:20] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 44, active_shards_percent_as_number: 85.41162227602905 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:20] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1411, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 239, delayed_unassigned_shards: 0, number_of_pendin [16:31:20] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 42, active_shards_percent_as_number: 85.41162227602905 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:22] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1637, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 225, delayed_unassigned_shards: 0, number_of_pending_t [16:31:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.54010695187165 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:22] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1637, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 225, delayed_unassigned_shards: 0, number_of_pending_t [16:31:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.54010695187165 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:23] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1415, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 237, delayed_unassigned_shards: 0, number_of_pendin [16:31:23] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.65375302663438 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:24] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1637, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 225, delayed_unassigned_shards: 0, number_of_pending_t [16:31:24] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.54010695187165 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1637, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 225, delayed_unassigned_shards: 0, number_of_pending_t [16:31:25] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.54010695187165 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:26] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1415, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 237, delayed_unassigned_shards: 0, number_of_pendin [16:31:26] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.65375302663438 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:27] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1415, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 237, delayed_unassigned_shards: 0, number_of_pendin [16:31:27] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.65375302663438 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:28] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1377, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 238, delayed_unassigned_shards: 0, number_of_pending_ta [16:31:28] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.15769944341372 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:32:20] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1538, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas [16:32:20] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.11440940012369 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:32:20] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1538, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas [16:32:20] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.11440940012369 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:32:22] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1539, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 76, delayed_unassigned_shards: 0, number_of_pending_tas [16:32:22] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.17625231910947 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:32:22] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1540, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 76, delayed_unassigned_shards: 0, number_of_pending_tas [16:32:22] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.23809523809523 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:32:57] (03PS1) 10Muehlenhoff: Add missing secret [labs/private] - 10https://gerrit.wikimedia.org/r/1204926 (https://phabricator.wikimedia.org/T409528) [16:34:25] (03PS3) 10Effie Mouzeli: prometheus: add temp recording rules for phpfpm_workers:active_percent [puppet] - 10https://gerrit.wikimedia.org/r/1204804 [16:34:35] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet [16:35:03] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1045.eqiad.wmnet with reason: C/D Migration [16:35:19] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:36:05] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet [16:36:10] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:07] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:38:21] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 275 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1377, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 275, delayed_unassigned_shard [16:38:21] mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.35351089588377 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:38:21] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 808, active_shards: 1347, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 270, delayed_unassigned_shards: [16:38:21] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.30241187384044 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:38:21] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1043, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0 [16:39:03] inflatador: ^ known? [16:39:28] (03PS1) 10Dreamy Jazz: MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) [16:39:46] sukhe Y, I mentioned it above but I guess it got lost in the shuffle. Our cook-book might not be setting suppressions properly [16:39:52] I'll set one now [16:40:11] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1046.eqiad.wmnet [16:40:22] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t [16:40:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:40:22] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t [16:40:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:40:22] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t [16:40:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:40:22] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t [16:40:23] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:40:23] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t [16:40:24] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:40:41] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1046.eqiad.wmnet with reason: C/D Migration [16:41:08] inflatador: no worries and thanks! [16:41:15] the only reason I was asking is because it was causing icinga-wm to quit [16:41:20] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet [16:41:22] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin [16:41:22] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:22] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin [16:41:22] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:22] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1537, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas [16:41:22] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.05256648113792 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:22] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1537, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas [16:41:23] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.05256648113792 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:23] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin [16:41:24] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:24] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1537, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas [16:41:25] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 21, active_shards_percent_as_number: 95.05256648113792 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:25] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin [16:41:26] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:26] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin [16:41:27] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:27] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1537, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas [16:41:28] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 29, active_shards_percent_as_number: 95.05256648113792 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:28] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1548, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 69, delayed_unassigned_shards: 0, number_of_pending_tas [16:41:29] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.73283858998145 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:44] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: T407520 [16:41:48] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [16:42:10] (03CR) 10Effie Mouzeli: prometheus: add temp recording rules for phpfpm_workers:active_percent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1204804 (owner: 10Effie Mouzeli) [16:42:12] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add missing secret [labs/private] - 10https://gerrit.wikimedia.org/r/1204926 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff) [16:42:36] jouncebot: nowandnext [16:42:36] For the next 0 hour(s) and 17 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1600) [16:42:36] In 0 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700) [16:42:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:43:12] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:43:36] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1049.eqiad.wmnet [16:44:01] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1049.eqiad.wmnet with reason: C/D Migration [16:44:39] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1050.eqiad.wmnet [16:44:55] 10SRE-SLO, 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11371606 (10elukey) 05Resolved→03Open Let's keep it open until the alerts are up :) [16:45:20] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:46:22] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 276 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1376, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 276, delayed_unassigned_shard [16:46:22] mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.29297820823246 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:46:22] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 311 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1045, active_shards: 1559, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 311, delayed_unassigned_shards: 0 [16:46:22] _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.36898395721926 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:46:23] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 807, active_shards: 1347, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 270, delayed_unassigned_shards: 0, [16:46:23] of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.30241187384044 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:46:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:47:07] (03CR) 10Harroyo-wmf: [C:03+1] MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz) [16:47:08] !log dancy@deploy2002 Installing scap version "4.226.0" for 2 host(s) [16:47:08] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:47:16] PROBLEM - Host idp1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:19] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1050.eqiad.wmnet [16:47:28] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:47:46] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1050.eqiad.wmnet with reason: C/D Migration [16:48:09] (03PS2) 10Dreamy Jazz: MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) [16:48:15] FIRING: ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:22] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1450, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 202, delayed_unassigned_shards: 0, number_of_pendin [16:48:22] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.77239709443099 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:48:22] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1432, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 185, delayed_unassigned_shards: 0, number_of_pending_ta [16:48:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 88.55905998763141 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:48:22] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1683, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 179, delayed_unassigned_shards: 0, number_of_pending_t [16:48:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 194, active_shards_percent_as_number: 90.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:48:54] !log dancy@deploy2002 Installation of scap version "4.226.0" completed for 2 hosts [16:49:15] (03PS3) 10Dreamy Jazz: MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) [16:49:21] (03CR) 10EarlyWarningBot: "[Failed command](https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php81/52088/consoleFull): `composer --ansi test`" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz) [16:49:45] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [16:51:02] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [16:51:17] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply wmf-opensearch-search-plugins update, other updates (see also T407110) - bking@cumin2002 - T407520 [16:51:21] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [16:51:30] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1038.eqiad.wmnet with reason: C/D Migration [16:51:53] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet [16:52:50] PROBLEM - Host urldownloader1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:53:16] FIRING: [2x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:23] FIRING: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:53:30] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11371631 (10elukey) Really nice! I'll be afk next week for holidays, but @RLazarus may be... [16:53:33] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy3001.wikimedia.org with reason: host reimage [16:53:56] (03CR) 10Scott French: [C:03+1] "Neat idea!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [16:54:46] RECOVERY - Host idp1005 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [16:55:02] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet [16:55:11] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1046.eqiad.wmnet with reason: C/D Migration [16:55:20] RECOVERY - Host urldownloader1004 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [16:56:04] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1051.eqiad.wmnet with reason: C/D Migration [16:56:18] !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1052.eqiad.wmnet [16:56:55] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy3001.wikimedia.org with reason: host reimage [16:57:22] (03PS5) 10Dreamy Jazz: VisualEditor hCaptcha: Add config to disable onload handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204855 (https://phabricator.wikimedia.org/T409962) (owner: 10Kosta Harlan) [16:57:27] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:58:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004#11371655 (10Raine) Thanks @Jhancock.wm , looks good! [16:58:15] RESOLVED: [2x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:17] jouncebot: nowandnext [16:58:17] For the next 0 hour(s) and 1 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1600) [16:58:17] In 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700) [16:58:23] RESOLVED: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:58:32] !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1052.eqiad.wmnet [16:58:56] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1052.eqiad.wmnet with reason: C/D Migration [16:58:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz) [16:58:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204855 (https://phabricator.wikimedia.org/T409962) (owner: 10Kosta Harlan) [16:59:24] (03CR) 10Effie Mouzeli: [C:03+2] prometheus: add temp recording rules for phpfpm_workers:active_percent [puppet] - 10https://gerrit.wikimedia.org/r/1204804 (owner: 10Effie Mouzeli) [17:00:04] jhathaway and moritzm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:38] (03PS2) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) [17:01:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11371693 (10RobH) All ganeti hosts migrated to their new switch ports in eqiad rows c/d [17:01:28] Dreamy_Jazz: Lemme know when you're done. I have a scap update to deploy. [17:01:49] Sure, it should just be these patches I have to deploy [17:01:58] But they are not merged yet so it could be a little bit [17:02:04] No prob. [17:02:19] !log restarting Tomcat on idp1005 [17:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:24] If you want me to stop scap while they are still merging? [17:02:30] The ETA on them being merged is 10 mins [17:02:58] Not sure how long scap updates are usually though so happy either way [17:03:12] It takes about 2 minutes to update scap, so that would work for me. [17:03:13] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:03:45] (03CR) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [17:03:45] I interrupted https://spiderpig.wikimedia.org/jobs/915, so floor is yours [17:03:49] Thanks! [17:03:57] !log dancy@deploy2002 Installing scap version "4.227.0" for 2 host(s) [17:04:14] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on restbase1040.eqiad.wmnet with reason: C/D Migration [17:05:44] !log dancy@deploy2002 Installation of scap version "4.227.0" completed for 2 hosts [17:06:01] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on restbase1041.eqiad.wmnet with reason: C/D Migration [17:06:03] Dreamy_Jazz: Back to you. [17:06:07] Thanks! [17:06:12] (03PS1) 10Bvibber: StickyHeaders: scroll-margin-top fixes [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) [17:06:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz) [17:06:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204855 (https://phabricator.wikimedia.org/T409962) (owner: 10Kosta Harlan) [17:06:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) (owner: 10Bvibber) [17:07:50] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on aqs1019.eqiad.wmnet with reason: C/D Migration [17:08:34] (03Merged) 10jenkins-bot: MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz) [17:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:11:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11371762 (10RobH) Day 5 Update: * Moved all remaining ganeti hosts today * 17 hosts moved today, 108osts remain. * All remaining hosts are either k8 hosts (i... [17:11:45] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [17:11:58] !log eqiad c/d migrations complete for today [17:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:25] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy3001.wikimedia.org with OS trixie [17:12:25] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy3001.wikimedia.org [17:12:34] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [17:12:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:38] (03CR) 10Bvibber: Reduce number of bucketsizes for MediaViewer (group0) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [17:14:03] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3002.wikimedia.org [17:14:05] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [17:15:44] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 8.459 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [17:16:25] (03Merged) 10jenkins-bot: VisualEditor hCaptcha: Add config to disable onload handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204855 (https://phabricator.wikimedia.org/T409962) (owner: 10Kosta Harlan) [17:16:48] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1204927|MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling (T405595)]], [[gerrit:1204855|VisualEditor hCaptcha: Add config to disable onload handling (T409962)]] [17:16:54] T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595 [17:16:54] T409962: hCaptcha VisualEditor: Don't render or load hCaptcha if hCaptcha is not yet enabled for that mode - https://phabricator.wikimedia.org/T409962 [17:17:18] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [17:17:36] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [17:17:36] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:17:37] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors [17:17:38] jouncebot: nowandnext [17:17:39] For the next 0 hour(s) and 42 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700) [17:17:39] In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late - extended edition) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1730) [17:17:40] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors [17:18:00] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host hcaptcha-proxy7001.wikimedia.org with OS trixie [17:18:00] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host hcaptcha-proxy7001.wikimedia.org [17:18:10] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [17:18:12] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [17:18:15] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [17:18:27] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS trixie [17:18:49] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [17:18:50] !log dreamyjazz@deploy2002 dreamyjazz, kharlan: Backport for [[gerrit:1204927|MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling (T405595)]], [[gerrit:1204855|VisualEditor hCaptcha: Add config to disable onload handling (T409962)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:18:54] (03CR) 10Dzahn: "arr.. so actually it's the other way around. this change would flip over which backend gets the traffic.. but you reminded me there IS a s" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [17:19:24] (03CR) 10Kamila Součková: [C:03+1] proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 (owner: 10Effie Mouzeli) [17:19:34] FYI, the MediaWiki infrastructure window is starting 30 minutes earlier than usual today to accommodate some complex changes. cc Dreamy_Jazz [17:19:53] Okay. Should be done after this [17:20:02] awesome, thanks! [17:20:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:21:20] !log dreamyjazz@deploy2002 dreamyjazz, kharlan: Continuing with sync [17:22:02] Yeah, testing complete so will have no need to do more backports after this [17:23:00] (03PS1) 10Dzahn: releases: flip the active backend from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127) [17:23:33] (03CR) 10Dzahn: "that other change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204933" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [17:23:47] (03CR) 10Dzahn: "will go closely with https://gerrit.wikimedia.org/r/c/operations/dns/+/1204684" [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [17:25:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:25:36] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204927|MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling (T405595)]], [[gerrit:1204855|VisualEditor hCaptcha: Add config to disable onload handling (T409962)]] (duration: 08m 48s) [17:25:41] T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595 [17:25:42] T409962: hCaptcha VisualEditor: Don't render or load hCaptcha if hCaptcha is not yet enabled for that mode - https://phabricator.wikimedia.org/T409962 [17:25:52] swfrench-wmf: Over to you when ready [17:25:59] Dreamy_Jazz: thanks! [17:28:24] (03CR) 10Clément Goubert: "This basically would make every call to mw-api-ext do a cross-dc DB call. I wonder how much latency this would add, but it may be acceptab" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [17:30:05] jhathaway and moritzm: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700). [17:30:05] No Gerrit patches in the queue for this window AFAICS. [17:30:05] swfrench: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late - extended edition). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1730). [17:30:14] o/ [17:31:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:32:53] just working on some final tests and will get started shortly [17:36:05] (03CR) 10Scott French: "Thanks for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:36:06] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): return main to nominal multi-DC size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:38:06] (03Merged) 10jenkins-bot: mw-(api-ext|web): return main to nominal multi-DC size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:39:35] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:39:54] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:40:10] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:41:18] !log scaled mw-api-ext/main to normal multi-DC size - T405955 [17:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:21] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:42:58] (03CR) 10Scott French: [C:03+2] rest-gateway: Stop diverting PHP_ENGINE=8.3 to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203573 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:44:50] (03Merged) 10jenkins-bot: rest-gateway: Stop diverting PHP_ENGINE=8.3 to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203573 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:52:16] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:52:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:52:40] (03PS3) 10Bearloga: EventStreamConfig: add stream for Growth and Editing team edit rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [17:58:03] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org [17:58:04] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [17:58:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:58:52] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [18:00:05] swfrench: gettimeofday() says it's time for MediaWiki infrastructure (UTC late - extended edition). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1730) [18:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1800). [18:00:46] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.622 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [18:01:28] nothing for my window this week. [18:01:36] (03CR) 10Ssingh: [C:03+1] "Thanks for the patch! Not sure how I missed this in the review." [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede) [18:02:16] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [18:02:25] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:02:28] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7001.wikimedia.org [18:02:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [18:02:42] !log stopped diverting PHP_ENGINE-enrolled traffic at rest-gateway - T405955 [18:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:46] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:02:59] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:05:47] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:06:05] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:06:46] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removing hcaptcha-proxy7001;failed makevm - sukhe@cumin1003" [18:06:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:07:14] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:07:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removing hcaptcha-proxy7001;failed makevm - sukhe@cumin1003" [18:07:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:22] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520 [18:07:28] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [18:07:29] !log scaled mw-web/main to normal multi-DC size - T405955 [18:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:16] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org [18:08:17] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:09:49] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha-proxy3002.wikimedia.org with OS trixie [18:09:50] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy3002.wikimedia.org [18:09:58] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [18:11:28] FYI, I am taking the scap lock to prevent deployments, which should not happen in our current capacity configuration [18:11:33] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003" [18:11:41] !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Hold deployments during mw-(api-ext|web) capacity changes - T405955 [18:11:45] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:11:48] (nice job on using that! we sometimes forget) [18:11:48] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003" [18:11:48] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:11:48] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors [18:11:49] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 55 hosts with reason: T407520 [18:11:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors [18:11:57] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:13:31] !log disable-puppet on A:cp hosts for ATS Lua config change - T405955 [18:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:22] (03CR) 10Scott French: [C:03+2] trafficserver: disable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1203569 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:14:43] (03CR) 10Bvibber: Reduce number of bucketsizes for MediaViewer (group0) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [18:15:03] 10ops-eqiad, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072 (10RobH) 03NEW p:05Triage→03Medium [18:15:25] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003" [18:15:41] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003" [18:15:41] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:15:41] (03CR) 10Eric Gardner: [C:03+1] StickyHeaders: scroll-margin-top fixes [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) (owner: 10Bvibber) [18:15:41] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors [18:15:45] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors [18:15:49] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7001.wikimedia.org [18:17:24] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox Cable report - incorrectly parsing Nokia power supplies - https://phabricator.wikimedia.org/T410073 (10RobH) 03NEW [18:18:10] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox Cable report - incorrectly parsing Nokia power supplies - https://phabricator.wikimedia.org/T410073#11372087 (10RobH) [18:18:20] (03CR) 10Ssingh: "Question though: if the host is pooled, which will be most cases, how do we run this cookbook then? Like in the comment above as by Valent" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:18:39] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3002.wikimedia.org [18:18:40] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:21:39] 06SRE, 06Data-Platform-SRE, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11372098 (10Ahoelzl) [18:22:00] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [18:22:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [18:22:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:22:20] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors [18:22:21] !log rolling run-puppet-agent on A:cp hosts for ATS Lua config change - T405955 [18:22:24] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors [18:22:28] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:31] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:28:10] sukhe@cumin1003 makevm (PID 3555498) is awaiting input [18:31:25] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [18:31:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:31:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [18:31:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:31:56] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [18:31:56] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:31:56] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors [18:32:00] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors [18:32:04] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy3002.wikimedia.org [18:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:33:04] (03CR) 10Bearloga: [C:04-1] EventStreamConfig: add stream for Growth and Editing team edit rates (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [18:40:37] (03PS2) 10Bvibber: Reduce number of bucketsizes for MediaViewer (labs, group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) [18:42:05] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy3002.wikimedia.org [18:42:22] !log manually running decomm cookbook on hcaptcha-proxy3002: host makevm failed, trying again T409860 [18:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:26] T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860 [18:43:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [18:44:23] (03PS1) 10Andrew Bogott: codfw1dev: roll back horizon version to 2025-06-23-141023 [puppet] - 10https://gerrit.wikimedia.org/r/1204940 [18:44:25] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): return next to "idle" size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203574 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:45:50] swfrench-wmf: Next up, 8.4? :-) [18:45:55] hehe [18:46:09] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:46:20] (03Merged) 10jenkins-bot: mw-(api-ext|web): return next to "idle" size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203574 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:46:26] (We don't even have CI voting for 8.4.) [18:46:27] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: roll back horizon version to 2025-06-23-141023 [puppet] - 10https://gerrit.wikimedia.org/r/1204940 (owner: 10Andrew Bogott) [18:46:32] (03PS1) 10BCornwall: wmf-debci: Also create man1 dir [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1204941 (https://phabricator.wikimedia.org/T352003) [18:48:05] !log zero external traffic on mw-(api-ext|web) next releases - T405955 [18:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:09] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:49:04] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy3002.wikimedia.org [18:49:14] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372191 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for... [18:49:55] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:50:09] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:50:20] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:50:31] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:51:55] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:52:03] (03CR) 10BCornwall: "I remember seeing `forbes-bio.org`'s page, incidentally - it was talking about making your way to forbes' lists by having a professional w" [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor) [18:52:09] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:52:15] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:52:26] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:53:02] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3002.wikimedia.org [18:53:04] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:53:43] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:53:57] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:54:16] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:54:27] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:55:50] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:56:05] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:56:10] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:56:21] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [18:56:22] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:56:37] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [18:56:37] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:56:37] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors [18:56:40] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors [18:56:57] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410080 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [18:57:02] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410080 (10ops-monitoring-bot) 03NEW [18:57:11] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [18:57:15] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003" [18:57:26] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS trixie [18:57:40] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [18:57:47] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:58:02] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:58:07] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:58:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:58:36] !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during mw-(api-ext|web) capacity changes - T405955 (duration: 46m 54s) [18:58:39] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:59:32] !log scaled mw-(api-ext|web)/next to "idle" size - T405955 [18:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] andre and jeena: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1900). [19:00:14] jouncebot: no! [19:00:21] 🤣 [19:13:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:14:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:17:19] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041#11372271 (10VRiley-WMF) [19:17:20] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410080#11372273 (10VRiley-WMF) →14Duplicate dup:03T410041 [19:20:06] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520 [19:20:10] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [19:22:43] (03CR) 10CDobbins: "I'd assumed that it was failing due to the combination of the dry-run flag and the fact that this requires hosts to be depooled. If that's" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [19:25:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520 [19:25:14] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [19:27:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:54] (03CR) 10Bking: [C:03+2] "ACK, we have changed the limitranges as well (ref Ic99ed2f2acf98d2be7723253821697525a46869f ), this will apply to the defaults as you said" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501) (owner: 10Bking) [19:29:08] (03PS1) 10Scott French: deployment_server: migrate mw-experimental to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1204945 (https://phabricator.wikimedia.org/T405955) [19:29:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:35:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:37:59] (03CR) 10Ssingh: "You are right, my bad. Confirming:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [19:40:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:41:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:47:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning [19:47:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11372349 (10ops-monitoring-bot) Start pool of db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning - marostegui@cumin1003 [19:47:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041#11372350 (10VRiley-WMF) a:03VRiley-WMF [19:47:52] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041#11372351 (10VRiley-WMF) 05Open→03Resolved This is a duplicate. [19:48:34] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha-proxy3002.wikimedia.org with OS trixie [19:48:34] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy3002.wikimedia.org [19:48:40] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [19:51:08] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372373 (10ssingh) `hcaptcha-proxy3001` worked just fine but `hcaptcha-proxy3002` does not come... [19:52:14] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy3002.wikimedia.org [19:54:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:55:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:56:15] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [19:59:35] 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372396 (10Jclark-ctr) https://netbox.wikimedia.org/dcim/cables/10192/ https://netbox.wikimedia.org/dcim/cables/10191/ These two are for lswtest-eqiad for those can be ignored for... [19:59:40] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [19:59:45] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [19:59:45] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:59:46] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy3002.wikimedia.org [19:59:56] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372397 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for... [20:00:18] 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372399 (10Jclark-ctr) [20:00:29] 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372401 (10Jclark-ctr) [20:02:12] 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372406 (10Jclark-ctr) a:03Jclark-ctr [20:02:56] 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372411 (10Jclark-ctr) 05Open→03Resolved https://netbox.wikimedia.org/dcim/cables/1169/ was from T407008 xe-3/1/5 removed from netbox [20:16:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:17:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:18:31] 10ops-eqiad, 06DC-Ops: Unresponsive management for cephosd1001.mgmt:22 - https://phabricator.wikimedia.org/T410088 (10phaultfinder) 03NEW [20:22:32] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test [20:26:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11372503 (10Jclark-ctr) [20:26:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041#11372506 (10Jclark-ctr) →14Duplicate dup:03T409938 [20:30:13] (03CR) 10Dzahn: "ACK! thank you. I am writing a short plan for the actual failover steps now and will include that. Actually.. looking now if I can improv" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [20:32:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning [20:32:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1241.eqiad.wmnet onto db1262.eqiad.wmnet [20:32:48] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11372525 (10ops-monitoring-bot) Completed pool of db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning - marostegui@cumin1003 [20:32:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11372526 (10ops-monitoring-bot) Finished cloning db1241.eqiad.wmnet to db1262.eqiad.wmnet - marostegui@cumin1003 [20:38:33] (03CR) 10Dzahn: "an actual plan: https://phabricator.wikimedia.org/P85324" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [20:45:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11372555 (10RobH) a:05MoritzMuehlenhoff→03LSobanski @LSobanski, The only two #infrastructure-foundations hosts left to migrate are >>! In T4... [20:52:35] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520 [20:52:40] T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520 [20:56:00] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11372572 (10EdErhart-WMF) Hey folks, coming from the YoW team - I am speaking with an imperfect understanding of all the concerns involved, but would 25years.wikip... [20:57:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [20:58:01] (03PS1) 10DLynch: Editcheck: flag suggestions when logging actions [extensions/VisualEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204957 (https://phabricator.wikimedia.org/T407170) [20:58:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204957 (https://phabricator.wikimedia.org/T407170) (owner: 10DLynch) [21:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T2100). [21:00:04] bvibber and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] o/ [21:00:27] o/ [21:01:10] I can deploy mine, or it's a small instrumentation change that'd be fine to just throw in with other patches. [21:01:20] ok i'm logged into spiderpig [21:01:31] i can do em all together, that should be fine :D [21:01:36] Works for me! [21:01:37] yours look sice and clean [21:01:41] *nice [21:01:43] i can't type today :D [21:02:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) (owner: 10Bvibber) [21:02:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [21:02:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204957 (https://phabricator.wikimedia.org/T407170) (owner: 10DLynch) [21:02:28] Thus the beauty of spiderpig saving us from flubbing all those shell commands. 🤩 [21:03:13] (03Merged) 10jenkins-bot: Reduce number of bucketsizes for MediaViewer (labs, group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [21:03:21] hehehe exactly [21:03:42] i let someone talk me into deploying from a bar once. never again ;) [21:05:06] At my previous job someone once showed off by deploying while riding the Space Mountain rollercoaster at Disneyland... [21:06:09] (03Merged) 10jenkins-bot: StickyHeaders: scroll-margin-top fixes [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) (owner: 10Bvibber) [21:06:29] (03PS4) 10Aaron Schulz: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) [21:06:52] (03CR) 10Aaron Schulz: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [21:07:05] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [21:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:12:52] (03Merged) 10jenkins-bot: Editcheck: flag suggestions when logging actions [extensions/VisualEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204957 (https://phabricator.wikimedia.org/T407170) (owner: 10DLynch) [21:13:15] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1204928|StickyHeaders: scroll-margin-top fixes (T409349)]], [[gerrit:1204700|Reduce number of bucketsizes for MediaViewer (labs, group0) (T372165)]], [[gerrit:1204957|Editcheck: flag suggestions when logging actions (T407170)]] [21:13:22] T409349: StickyHeaders: legacy parser h3-6 section links obscure content - https://phabricator.wikimedia.org/T409349 [21:13:22] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:13:23] T407170: Create Superset dashboard to see how edit suggestions are performing (overall and by type) - https://phabricator.wikimedia.org/T407170 [21:13:50] Glad CI for wmf/* worked nice and smoothly and no-one noticed we switched from PHP 8.1 to 8.3. :-) [21:14:06] 🔥 [21:15:17] (Dropping PHP 8.1 from dev branch, and maybe REL1_45, coming Soon™.) [21:15:26] !log bvibber@deploy2002 bvibber, kemayo: Backport for [[gerrit:1204928|StickyHeaders: scroll-margin-top fixes (T409349)]], [[gerrit:1204700|Reduce number of bucketsizes for MediaViewer (labs, group0) (T372165)]], [[gerrit:1204957|Editcheck: flag suggestions when logging actions (T407170)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:16:15] Kemayo: can yours be tested on test servers? [21:16:17] mine look good [21:16:33] James_F: wooooooo [21:16:43] bvibber: Yes, and I just tested it and it seems fine. [21:16:48] awesome [21:16:52] !log bvibber@deploy2002 bvibber, kemayo: Continuing with sync [21:17:42] Now I just need to work out how to convince Brooke to deploy from a bar again. [21:17:52] step 1: buy me some beers [21:17:59] maybe in milan ;) [21:18:01] Step 0: Go to someone Brooke is. :-) [21:18:06] Oooh, yes, Milan will be fun. [21:18:16] !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [21:18:23] i've been out of circulation for a while, it'll be good to get to a hackathon again :) [21:18:23] * James_F is considering getting the train to Milan from maybe Paris, just because trains are fun. [21:18:26] <3 [21:19:01] did i hear trains [21:19:21] taavi: https://www.trenitalia.com/en/frecce/frecciarossa.html [21:19:36] do the Thalys. Paris - Bruxelles - Cologne - oh nooo.. whaaat.. Wikipedia article uses "was" https://en.wikipedia.org/wiki/Thalys [21:20:22] that's called eurostar these days, not to be confused with the old eurostar [21:20:25] mutante: If I start from London it'd be nice to only change once. LON -> BRU -> CGN -> … is a bit much. [21:20:39] mutante: As opposed to LON -> PAR -> MLN. [21:20:41] "Eurostar" then [21:20:54] James_F: unfortunately the hackathon is happening on the one weekend next year where I have a conflict and won't be able to make it :( [21:21:10] taavi: Boooooo. Will you at least make it to Paris for Wikimania? [21:21:20] All these plans to get things done. [21:21:50] I hope to, but can't say for sure yet [21:22:13] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204928|StickyHeaders: scroll-margin-top fixes (T409349)]], [[gerrit:1204700|Reduce number of bucketsizes for MediaViewer (labs, group0) (T372165)]], [[gerrit:1204957|Editcheck: flag suggestions when logging actions (T407170)]] (duration: 08m 58s) [21:22:14] no direct flights to italy from portland, i'll have to change planes and/or trains somewhere. might come up with something clever :) [21:22:16] you should name mediawiki releases/sprints after famous trains [21:22:20] T409349: StickyHeaders: legacy parser h3-6 section links obscure content - https://phabricator.wikimedia.org/T409349 [21:22:20] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:22:20] T407170: Create Superset dashboard to see how edit suggestions are performing (overall and by type) - https://phabricator.wikimedia.org/T407170 [21:22:21] Kemayo: done! [21:22:29] bvibber: Thanks! [21:22:54] I have a config patch to sync, when you are done [21:23:39] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [21:23:43] bvibber: Portland -> Burbank -> (robot taxi) -> LAX -> Rome [21:24:33] kostajh: all rady, you need me to run it? [21:24:38] well, or just via LAX.. just saying Burbank because it's so much smaller [21:24:52] *ready [21:26:31] bvibber: I can sync it, thanks [21:26:34] cool [21:27:01] mutante: lax is pretty far out of the way; great circle route from Portland to Milan says better places to change planes are ... Iceland or London :D [21:27:25] if the iceland seasonal is there i should totally book that [21:28:35] bvibber: oh, yea. then maybe check if you can get 23h50m layover. if it's under 24 hours it is still one ticket. but a full night and day to see something, sleep and then continue travel can be so nice [21:29:00] (03PS1) 10Scott French: Deploy known-client rate limits and multi-select fixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1204963 [21:31:32] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204965 [21:31:34] (03CR) 10Scott French: [V:03+2] "Tested locally at `3569d73b27557b50a12f73287a7a139ccae0f4ec`." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1204963 (owner: 10Scott French) [21:32:08] (03CR) 10Scott French: [V:03+2 C:03+2] Deploy known-client rate limits and multi-select fixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1204963 (owner: 10Scott French) [21:32:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:33:11] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: known-client rate limits and multi-select fixes - swfrench@cumin2002" [21:33:13] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: known-client rate limits and multi-select fixes - swfrench@cumin2002 [21:34:02] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: known-client rate limits and multi-select fixes - swfrench@cumin2002 [21:34:04] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: known-client rate limits and multi-select fixes - swfrench@cumin2002" [21:35:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:35:56] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11372753 (10Dzahn) I think we can roughly order the options like this, from easiest / most standard to least recommended / potentially most problematic: - SOMETHI... [21:36:51] jouncebot: nowandnext [21:36:52] For the next 0 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T2100) [21:36:52] In 0 hour(s) and 23 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T2200) [21:39:37] (03PS1) 10Kosta Harlan: hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204966 (https://phabricator.wikimedia.org/T405586) [21:39:42] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1003.eqiad.wmnet with reason: sleep test [21:40:04] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [21:43:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204966 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [21:44:11] (03Merged) 10jenkins-bot: hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204966 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [21:44:30] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204966|hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits (T405586)]] [21:44:34] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [21:46:34] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1204966|hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:49:22] !log kharlan@deploy2002 kharlan: Continuing with sync [21:53:26] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204966|hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits (T405586)]] (duration: 08m 56s) [21:53:31] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T2200) [22:00:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:02:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:07:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:12:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:28:53] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test [22:31:35] (03CR) 10Scott French: [C:03+1] "+1 to moving ahead with this in the interim to keep the action-API migration moving forward." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [22:31:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:31:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [22:31:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:57:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: 10Jdlrobson) [22:59:06] (03Merged) 10jenkins-bot: Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: 10Jdlrobson) [22:59:24] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1200173|Drop references to removed Advanced mobile contribution configuration (T402470)]] [22:59:29] T402470: Remove AMC Outreach code from Mobile - https://phabricator.wikimedia.org/T402470 [23:01:52] !log catrope@deploy2002 catrope, jdlrobson: Backport for [[gerrit:1200173|Drop references to removed Advanced mobile contribution configuration (T402470)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:04:46] !log catrope@deploy2002 catrope, jdlrobson: Continuing with sync [23:08:54] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200173|Drop references to removed Advanced mobile contribution configuration (T402470)]] (duration: 09m 30s) [23:08:59] T402470: Remove AMC Outreach code from Mobile - https://phabricator.wikimedia.org/T402470 [23:10:25] FIRING: SystemdUnitFailed: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:25] RESOLVED: SystemdUnitFailed: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:51] (03PS1) 10Jforrester: Enable embedded Wikifunctions on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204979 (https://phabricator.wikimedia.org/T401683) [23:22:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204979 (https://phabricator.wikimedia.org/T401683) (owner: 10Jforrester) [23:27:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:45] !log denisse@deploy2002 Started deploy [librenms/librenms@5fca3ff]: Upgrade LibreNMS to 25.10.0 - T410039 [23:31:00] !log denisse@deploy2002 Finished deploy [librenms/librenms@5fca3ff]: Upgrade LibreNMS to 25.10.0 - T410039 (duration: 00m 15s) [23:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:39:10] (03PS1) 10Dzahn: releases: control jenkins service by DC name, not host name [puppet] - 10https://gerrit.wikimedia.org/r/1204980 (https://phabricator.wikimedia.org/T392127) [23:41:08] (03PS2) 10Dzahn: releases: control jenkins service by DC name, not host name [puppet] - 10https://gerrit.wikimedia.org/r/1204980 (https://phabricator.wikimedia.org/T392127) [23:42:39] (03PS1) 10Dzahn: releases: stop/mask jenkins in eqiad, start/unmask jenkins in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204982 (https://phabricator.wikimedia.org/T392127) [23:47:29] (03CR) 10Dzahn: "> My answer was the we will also need unmask/enable the Jenkins service manually in the new primary. Conversely we will have to mask/disab" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)