[00:06:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:07:25] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 3.394 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:14:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:15:27] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 4.482 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:18:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:20:25] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.257 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:23:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:25:25] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30035 bytes in 3.076 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:28:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:33:27] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 5.270 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:38:49] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204709
[00:38:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204709 (owner: 10TrainBranchBot)
[00:39:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:40:25] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 2.832 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:51:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:54:23] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:54:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204709 (owner: 10TrainBranchBot)
[00:57:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:58:29] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 6.177 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:00:42] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:08:33] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713
[01:08:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 (owner: 10TrainBranchBot)
[01:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:09:39] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.006e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[01:11:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:13:47] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 04s)
[01:15:23] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 0.209 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:33:23] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:42:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:01:39] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[02:09:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 (owner: 10TrainBranchBot)
[02:09:47] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 (owner: 10TrainBranchBot)
[02:20:13] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:21:13] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:39:03] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204713 (owner: 10TrainBranchBot)
[02:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:54:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[02:57:27] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.150 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:00:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:03:25] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 0.759 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:06:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:08:27] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.691 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:12:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:13:33] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 9.524 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:26:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:32:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:32:33] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30045 bytes in 9.464 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:33:27] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:34:07] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:34:27] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:35:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:36:29] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30045 bytes in 5.856 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:37:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:40:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:47:33] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 8.595 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:01:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:06:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:32:35] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:38:27] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.571 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:42:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:47:33] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 8.602 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:50:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:02:33] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 9.679 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:05:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:05:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:05:53] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:07:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:08:23] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:19:25] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 1.737 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:32:16] <wikibugs>	 (03PS3) 10Krinkle: Fix symbolic links [extensions/WikimediaMaintenance] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203281
[05:32:22] <wikibugs>	 (03Abandoned) 10Krinkle: Fix symbolic links [extensions/WikimediaMaintenance] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203281 (owner: 10Krinkle)
[05:33:23] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:38:33] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:42:25] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 2.252 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:42:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:50:35] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:57:29] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.652 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:00:35] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:02:04] <kart_>	 Deploying MinT/machinetranslation.
[06:02:21] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] machinetranslation: Increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry)
[06:04:12] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: Increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry)
[06:06:48] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[06:08:34] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[06:11:31] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 5.801 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:14:24] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[06:14:35] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:17:22] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[06:18:33] <kart_>	 !log machinetranslation: Increase replicas (T386371)
[06:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:37] <stashbot>	 T386371: Request capacity increase in preparation for MinT for wiki Readers experiment - https://phabricator.wikimedia.org/T386371
[06:20:33] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 7.610 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:23:35] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:26:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11369368 (10Marostegui) @Jclark-ctr did the DIMM arrive in the end? Thanks!
[06:27:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:27:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11369379 (10Jclark-ctr) @marostegui it arrived yesterday afternoon will be replacing first thing this morning
[06:28:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11369380 (10Marostegui) Great news, thank you!
[06:35:11] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T409818
[06:35:15] <stashbot>	 T409818: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T409818
[06:36:27] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 1.738 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:36:52] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2161 with weight 0 T409818', diff saved to https://phabricator.wikimedia.org/P85291 and previous config saved to /var/cache/conftool/dbconfig/20251113-063651-fceratto.json
[06:38:23] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:40:35] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:43:29] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 5.535 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:45:25] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1203790 (https://phabricator.wikimedia.org/T409818) (owner: 10Gerrit maintenance bot)
[06:47:35] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:47:42] <federico3>	 !log Starting s8 codfw failover from db2165 to db2161 - T409818
[06:47:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:46] <stashbot>	 T409818: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T409818
[06:49:30] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Set s8 codfw as read-only for maintenance - T409818', diff saved to https://phabricator.wikimedia.org/P85292 and previous config saved to /var/cache/conftool/dbconfig/20251113-064929-fceratto.json
[06:49:31] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 6.012 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:53:35] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:53:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2161 to s8 primary and set section read-write T409818', diff saved to https://phabricator.wikimedia.org/P85293 and previous config saved to /var/cache/conftool/dbconfig/20251113-065342-fceratto.json
[06:53:46] <stashbot>	 T409818: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T409818
[06:54:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:56:26] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1203791 (https://phabricator.wikimedia.org/T409818) (owner: 10Gerrit maintenance bot)
[06:57:28] <logmsgbot>	 !log fceratto@dns1004 START - running authdns-update
[06:58:25] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 1.634 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:58:30] <logmsgbot>	 !log fceratto@dns1004 END - running authdns-update
[06:59:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:59:58] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db2165 T409818', diff saved to https://phabricator.wikimedia.org/P85294 and previous config saved to /var/cache/conftool/dbconfig/20251113-065957-fceratto.json
[07:00:02] <stashbot>	 T409818: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T409818
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T0700).
[07:02:03] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade
[07:02:15] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2165 - Upgrading db2165.codfw.wmnet
[07:02:22] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2165 - Upgrading db2165.codfw.wmnet
[07:05:04] <wikibugs>	 (03PS1) 10Federico Ceratto: db2165: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1204746 (https://phabricator.wikimedia.org/T406008)
[07:05:22] <logmsgbot>	 fceratto@cumin1003 major-upgrade (PID 2865916) is awaiting input
[07:06:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] db2165: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1204746 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto)
[07:06:34] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:08:02] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] db2165: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1204746 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto)
[07:10:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:10:53] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:18:32] <logmsgbot>	 !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=x3
[07:18:38] <logmsgbot>	 !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=x3
[07:18:59] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1022].eqiad.wmnet with reason: Cloning clouddb1022:s3
[07:19:34] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 9.473 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:19:42] <logmsgbot>	 !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=x3
[07:19:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> If we add a dependency on a puppetdb it means we can't have a test setup in cloud unless we build and maintain our own local puppetdb in" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[07:19:55] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1022].eqiad.wmnet with reason: Cloning clouddb1022:s3
[07:20:03] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2165 gradually with 4 steps - Migration of db2165.codfw.wmnet completed
[07:22:34] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:23:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): return main to nominal multi-DC size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[07:23:23] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:23:34] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): return next to "idle" size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203574 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[07:23:56] <wikibugs>	 (03PS1) 10Marostegui: check_private_data_report: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204749 (https://phabricator.wikimedia.org/T409557)
[07:24:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] rest-gateway: Stop diverting PHP_ENGINE=8.3 to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203573 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[07:24:34] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 9.865 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:24:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204749 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[07:25:43] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:27:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] trafficserver: disable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1203569 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[07:27:34] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:30:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:30:48] <jinxer-wm>	 FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:33:34] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache/haproxy: set x-trusted-request to D for UA-compliant robots [puppet] - 10https://gerrit.wikimedia.org/r/1203054 (https://phabricator.wikimedia.org/T406545)
[07:35:43] <jinxer-wm>	 FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:40:43] <jinxer-wm>	 FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:41:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] cache/haproxy: set x-trusted-request to D for UA-compliant robots [puppet] - 10https://gerrit.wikimedia.org/r/1203054 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto)
[07:45:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:45:48] <jinxer-wm>	 FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:50:43] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:50:48] <jinxer-wm>	 FIRING: [14x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:55:43] <jinxer-wm>	 FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:43] <jinxer-wm>	 RESOLVED: [7x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:05:32] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2165 gradually with 4 steps - Migration of db2165.codfw.wmnet completed
[08:05:33] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0)
[08:06:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1204609 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[08:08:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet
[08:10:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet
[08:14:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet
[08:15:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet
[08:20:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11369646 (10MoritzMuehlenhoff) @RobH I've drained the two next hosts: ganeti1027 and ganeti1034 can be migrated next.  When these are done and you...
[08:20:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from alertmanager access [puppet] - 10https://gerrit.wikimedia.org/r/1204620 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[08:25:33] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update pwstore docs to point to cumin1003 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1204375 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[08:28:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove cumin1002 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1204797 (https://phabricator.wikimedia.org/T389380)
[08:29:28] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 3.097 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:34:34] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:36:24] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30040 bytes in 0.852 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:40:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove cumin1002 from mysql root list [puppet] - 10https://gerrit.wikimedia.org/r/1204799 (https://phabricator.wikimedia.org/T389380)
[08:44:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Remove cumin1002 from mysql root list [puppet] - 10https://gerrit.wikimedia.org/r/1204799 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[08:44:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump changelog for 1.0.4 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1204800
[08:52:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from mysql root list [puppet] - 10https://gerrit.wikimedia.org/r/1204799 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[08:53:08] <wikibugs>	 (03CR) 10Matthias Mullie: Reduce number of bucketsizes for MediaViewer (group0) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[08:57:23] <wikibugs>	 (03CR) 10David Caro: [C:03+2] maintain-dbusers: add stat for last run [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro)
[08:58:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:59:20] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet
[08:59:41] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1001.eqiad.wmnet
[09:00:05] <jouncebot>	 andre and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T0900)
[09:01:19] <wikibugs>	 (03CR) 10David Caro: [C:03+2] maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro)
[09:01:28] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet
[09:01:45] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1001.eqiad.wmnet
[09:02:59] <wikibugs>	 (03Merged) 10jenkins-bot: maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro)
[09:04:37] <logmsgbot>	 jmm@cumin2002 netbox (PID 264227) is awaiting input
[09:05:49] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[09:08:48] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204803 (https://phabricator.wikimedia.org/T408272)
[09:08:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204803 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot)
[09:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:09:44] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204803 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot)
[09:13:59] <wikibugs>	 (03PS1) 10Effie Mouzeli: prometheus: add recording rules for phpfpm_workers:active_percent [puppet] - 10https://gerrit.wikimedia.org/r/1204804
[09:19:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1204639 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking)
[09:20:11] <andre>	 Help wanted: Train deployment to group2 fails with an issue in the docker-registry
[09:20:23] <andre>	 09:10:47 [mediawiki-publish-83] received unexpected HTTP status: 500 Internal Server Error
[09:20:24] <andre>	  subprocess.CalledProcessError: Command '['sudo', '/usr/local/bin/docker-pusher', '-q', 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-11-13-090954-publish-83']' returned non-zero exit status 1. 
[09:37:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11369925 (10cmooney) @papaul I'm really getting sick of Juniper on this one.  Personally I suspect the input voltage/frequency (i.e. our feed...
[09:39:41] <wikibugs>	 (03CR) 10Jaime Nuche: "hi there, I was the relenger asked about this:" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[09:42:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:46:19] <wikibugs>	 (03CR) 10Jaime Nuche: "I'm thinking now that I didn't read the commit message right. It seems this change is only about changing the location for uploads and I'm" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[09:46:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:49:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:50:48] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet
[09:50:50] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[09:54:31] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1001.eqiad.wmnet - fceratto@cumin1003"
[09:54:59] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1001.eqiad.wmnet - fceratto@cumin1003"
[09:54:59] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:54:59] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.wipe-cache db-test1001.eqiad.wmnet on all recursors
[09:55:03] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test1001.eqiad.wmnet on all recursors
[09:55:35] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1001.eqiad.wmnet - fceratto@cumin1003"
[09:55:39] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1001.eqiad.wmnet - fceratto@cumin1003"
[09:56:57] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db-test1001.eqiad.wmnet with OS trixie
[10:00:18] <wikibugs>	 (03PS1) 10Marostegui: db1264: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1204807 (https://phabricator.wikimedia.org/T407941)
[10:00:29] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump changelog for 1.0.4 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1204800 (owner: 10Muehlenhoff)
[10:01:50] <wikibugs>	 (03CR) 10Marostegui: "Host green in icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1204807 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui)
[10:01:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1264: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1204807 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui)
[10:03:49] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on all eqiad1 cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1204623 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[10:03:56] <logmsgbot>	 !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.2  refs T408272
[10:04:00] <stashbot>	 T408272: 1.46.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T408272
[10:07:35] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1264 slowly with 10 steps - Pooling for the first time
[10:07:45] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1001.eqiad.wmnet with reason: host reimage
[10:08:47] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1264 slowly with 10 steps - Pooling for the first time
[10:08:57] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1264 slowly with 10 steps - Pooling for the first time
[10:10:15] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format es1056 [puppet] - 10https://gerrit.wikimedia.org/r/1204809
[10:11:21] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove cumin1002 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1204797 (https://phabricator.wikimedia.org/T389380)
[10:12:25] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format es1056 [puppet] - 10https://gerrit.wikimedia.org/r/1204809 (owner: 10Marostegui)
[10:12:49] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1001.eqiad.wmnet with reason: host reimage
[10:15:47] <wikibugs>	 (03PS1) 10David Caro: maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810
[10:16:42] <logmsgbot>	 !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3
[10:16:51] <wikibugs>	 (03PS2) 10David Caro: maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810
[10:22:04] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize clouddb1023 [puppet] - 10https://gerrit.wikimedia.org/r/1204812 (https://phabricator.wikimedia.org/T409557)
[10:23:58] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100%
[10:25:10] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[10:25:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 as Homer git peer [puppet] - 10https://gerrit.wikimedia.org/r/1204622 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[10:26:50] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] deployment_server: migrate mediawiki-dumps-legacy to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203578 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[10:29:00] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 64.58 ms
[10:29:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm
[10:30:12] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 64.39 ms
[10:31:52] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on eqiad1 cloudnets [puppet] - 10https://gerrit.wikimedia.org/r/1204624 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[10:33:35] <wikibugs>	 (03PS1) 10Mvolz: Remove deprecated parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204813 (https://phabricator.wikimedia.org/T361576)
[10:33:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove deprecated parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204813 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[10:34:14] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test1001.eqiad.wmnet with OS trixie
[10:34:14] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test1001.eqiad.wmnet
[10:37:53] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on remaining eqiad1 nodes [puppet] - 10https://gerrit.wikimedia.org/r/1204625 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[10:38:49] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Enable an oauth2-proxy for growthbook frontend and api pods (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis)
[10:41:50] <wikibugs>	 (03PS1) 10Majavah: P:openstack: neutron: Set MTU setting to 9000 in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1204814 (https://phabricator.wikimedia.org/T409544)
[10:44:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::cloud_private_subnet: Cleanup feature flag for jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1204626 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[10:44:40] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 12 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1204626 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[10:44:51] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloud_private_subnet: Cleanup feature flag for jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1204626 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[10:45:07] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1204814 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah)
[10:46:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[10:46:44] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[10:48:14] <wikibugs>	 (03PS6) 10Majavah: P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627
[10:49:19] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db-test1001.eqiad.wmnet with reason: Cloning
[10:49:54] <moritzm>	 !log installing libfcgi security updates
[10:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[10:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:52:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:openstack: neutron: Set MTU setting to 9000 in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1204814 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah)
[10:53:12] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: neutron: Set MTU setting to 9000 in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1204814 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah)
[10:53:53] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7607/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah)
[10:55:25] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Traffic: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019 (10Vgutierrez) 03NEW
[10:55:35] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Traffic: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11370189 (10Vgutierrez) p:05Triage→03High
[10:56:14] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Traffic, 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11370190 (10Vgutierrez)
[10:56:58] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 64.50 ms
[10:57:18] <wikibugs>	 06SRE, 06Data-Platform-SRE, 06Traffic, 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11370193 (10Vgutierrez)
[10:57:39] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020 (10jcrespo) 03NEW
[10:57:41] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06Traffic, 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11370205 (10Vgutierrez)
[10:58:03] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020#11370206 (10jcrespo)
[10:58:58] <wikibugs>	 (03PS2) 10Tchanders: Freeze LiquidThreads on enwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717)
[10:59:13] <moritzm>	 !log upgrade Envoy on idm* T405808
[10:59:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:18] <stashbot>	 T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808
[10:59:22] <wikibugs>	 (03CR) 10Tchanders: "We have the go-ahead from comms" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1100)
[11:00:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810 (owner: 10David Caro)
[11:01:49] <wikibugs>	 (03CR) 10David Caro: [C:03+2] maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810 (owner: 10David Caro)
[11:02:59] <wikibugs>	 (03Merged) 10jenkins-bot: maintain_dbusers: fix MaintainDBUsersDown expression [alerts] - 10https://gerrit.wikimedia.org/r/1204810 (owner: 10David Caro)
[11:03:23] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11370232 (10MGerlach)
[11:03:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11370233 (10MGerlach)
[11:07:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Setup cumin1002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1204818 (https://phabricator.wikimedia.org/T389380)
[11:08:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bookworm
[11:11:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:16:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:20:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Setup cumin1002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1204818 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[11:20:56] <wikibugs>	 (03PS1) 10David Caro: maintain_dbusers: initialize the stats [puppet] - 10https://gerrit.wikimedia.org/r/1204820
[11:21:27] <wikibugs>	 (03PS1) 10Marco Fossati: ImageBrowsing: add tier 2 experiment [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204821 (https://phabricator.wikimedia.org/T409739)
[11:22:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204821 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati)
[11:22:46] <wikibugs>	 (03PS1) 10Marco Fossati: xLab: add tier 2 experiment to ImageBrowsing [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204822 (https://phabricator.wikimedia.org/T409739)
[11:23:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204822 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati)
[11:23:23] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:25:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "This can be merged now, cumin1002 has been moved to the insetup role for eventual decom" [puppet] - 10https://gerrit.wikimedia.org/r/1204628 (https://phabricator.wikimedia.org/T389380) (owner: 10Marostegui)
[11:26:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmf_root_client.pp: Remove cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/1204628 (https://phabricator.wikimedia.org/T389380) (owner: 10Marostegui)
[11:26:49] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "Tested in cloudcontrol1007:" [puppet] - 10https://gerrit.wikimedia.org/r/1204820 (owner: 10David Caro)
[11:28:30] <wikibugs>	 (03PS1) 10David Caro: maintain_dbusers: fix MaintainDBUsersManyErrors expression [alerts] - 10https://gerrit.wikimedia.org/r/1204827
[11:28:30] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:29:11] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:30:27] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:30:47] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:30:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders)
[11:32:07] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:32:22] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:35:33] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable nftables on cluster::management on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1204368 (https://phabricator.wikimedia.org/T389380)
[11:36:36] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Update config for addurl trigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204830 (https://phabricator.wikimedia.org/T409957)
[11:38:23] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:41:17] <wikibugs>	 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Kanban Board), 07Essential-Work: Record api-user-agent in metrics; filter by MediaWikiJs - https://phabricator.wikimedia.org/T402385#11370307 (10Mvolz) 05Open→03Resolved
[11:44:19] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:45:13] <wikibugs>	 (03PS1) 10Klausman: homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832
[11:45:19] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:45:38] <wikibugs>	 (03CR) 10FNegri: maintain_dbusers: fix MaintainDBUsersManyErrors expression (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204827 (owner: 10David Caro)
[11:45:59] <wikibugs>	 (03PS2) 10Klausman: homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832
[11:46:38] <wikibugs>	 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11370320 (10Mvolz) The dashboard now has this: https://grafana.wikimedia.org/d/NJkCVermz/c...
[11:46:39] <wikibugs>	 (03PS3) 10Klausman: homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832
[11:47:16] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:48:23] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:51:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204368 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[11:53:46] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565)
[11:55:09] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:59:09] <wikibugs>	 (03PS2) 10David Caro: maintain_dbusers: fix MaintainDBUsersManyErrors expression [alerts] - 10https://gerrit.wikimedia.org/r/1204827
[11:59:16] <wikibugs>	 (03CR) 10David Caro: maintain_dbusers: fix MaintainDBUsersManyErrors expression (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204827 (owner: 10David Caro)
[12:00:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:06:09] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Disable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204619 (owner: 10Clément Goubert)
[12:07:09] <moritzm>	 !log uploaded wmf-laptop 1.0.4 to apt.wikimedia.org
[12:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:10] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020#11370467 (10jcrespo) 🤨 {F70174897}
[12:32:46] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204849
[12:33:21] <moritzm>	 !log installing bind security updates (client-side tools/libs only)
[12:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:47] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup)
[12:35:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix alias [puppet] - 10https://gerrit.wikimedia.org/r/1204844 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:37:36] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020#11370496 (10jcrespo) Garage also doesn't support TLS/HTTPS be default, it requires a reverse proxy: https://garagehq...
[12:39:38] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Set ratelimit key_prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204850
[12:42:40] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+1] "yes, please" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204850 (owner: 10Clément Goubert)
[12:45:29] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Set ratelimit key_prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204850 (owner: 10Clément Goubert)
[12:47:26] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Set ratelimit key_prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204850 (owner: 10Clément Goubert)
[12:48:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:48:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:49:51] <wikibugs>	 (03PS2) 10Effie Mouzeli: prometheus: add temp recording rules for phpfpm_workers:active_percent [puppet] - 10https://gerrit.wikimedia.org/r/1204804
[12:50:30] <Amir1>	 !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists bawiktionary; drop database if exists chwikimedia; drop database if exists closed_zh_twwiki; drop database if exists comcomwiki; (T297297)
[12:50:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:34] <stashbot>	 T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297
[12:50:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[12:50:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[12:50:58] <wikibugs>	 (03CR) 10David Caro: [C:03+2] maintain_dbusers: fix MaintainDBUsersManyErrors expression [alerts] - 10https://gerrit.wikimedia.org/r/1204827 (owner: 10David Caro)
[12:52:10] <wikibugs>	 (03Merged) 10jenkins-bot: maintain_dbusers: fix MaintainDBUsersManyErrors expression [alerts] - 10https://gerrit.wikimedia.org/r/1204827 (owner: 10David Caro)
[12:55:31] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Upgrade
[12:57:12] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157 (owner: 10BryanDavis)
[12:57:53] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] hcaptcha: Respond with HTTP 405 to disallowed methods [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis)
[13:00:06] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1300)
[13:01:03] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[13:01:05] <moritzm>	 !log installing amd64-microcode security updates
[13:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:43] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Upgrade
[13:06:16] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[13:07:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:08:24] <wikibugs>	 (03PS1) 10Kamila Součková: hcaptcha proxy: add missing ; in nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1204863
[13:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:10:41] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] hcaptcha proxy: add missing ; in nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1204863 (owner: 10Kamila Součková)
[13:11:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11370609 (10Jclark-ctr)
[13:11:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409980#11370611 (10Jclark-ctr) →14Duplicate dup:03T409938
[13:11:12] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] hcaptcha proxy: add missing ; in nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1204863 (owner: 10Kamila Součková)
[13:12:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11370618 (10Jclark-ctr)
[13:12:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409967#11370620 (10Jclark-ctr) →14Duplicate dup:03T409938
[13:12:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:14:35] <wikibugs>	 (03PS1) 10Brouberol: airflow: release new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204864 (https://phabricator.wikimedia.org/T408711)
[13:16:03] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[13:17:52] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: release new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204864 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[13:19:11] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[13:19:50] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Upgrade
[13:21:08] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223)
[13:23:49] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[13:24:31] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[13:25:25] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204849 (owner: 10PipelineBot)
[13:25:48] <wikibugs>	 (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204666 (owner: 10PipelineBot)
[13:26:21] <wikibugs>	 (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203434 (owner: 10PipelineBot)
[13:26:46] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] mariadb: Productionize clouddb1023 [puppet] - 10https://gerrit.wikimedia.org/r/1204812 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[13:26:54] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize clouddb1023 [puppet] - 10https://gerrit.wikimedia.org/r/1204812 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[13:27:09] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204849 (owner: 10PipelineBot)
[13:28:43] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[13:29:04] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Upgrade
[13:29:10] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[13:29:22] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[13:30:09] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[13:30:34] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[13:31:12] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[13:31:39] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:31:57] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[13:32:11] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:32:29] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:33:27] <wikibugs>	 (03PS1) 10KartikMistry: Update Recommendation API to 2025-11-10-154629-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204868 (https://phabricator.wikimedia.org/T403730)
[13:35:16] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main1006 is CRITICAL: SSL CRITICAL - Certificate kafka-main1006.eqiad.wmnet valid until 2025-11-20 13:35:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[13:36:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370719 (10Jclark-ctr) @marostegui Memory has been replaced server is back up and all yours   Thank you
[13:36:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370720 (10Jclark-ctr) 05Open→03Resolved
[13:37:01] <wikibugs>	 (03CR) 10DCausse: Add makeTargetDir function to create target directory (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[13:37:33] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[13:37:51] <wikibugs>	 (03CR) 10Sbisson: [C:03+1] Update Recommendation API to 2025-11-10-154629-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204868 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry)
[13:38:42] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-11-10-154629-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204868 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry)
[13:39:12] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-services: add cassandra endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204869 (https://phabricator.wikimedia.org/T409414)
[13:40:38] <wikibugs>	 (03Merged) 10jenkins-bot: Update Recommendation API to 2025-11-10-154629-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204868 (https://phabricator.wikimedia.org/T403730) (owner: 10KartikMistry)
[13:40:45] <kart_>	 Updating recommendation-api ..
[13:42:01] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11370726 (10MoritzMuehlenhoff)
[13:42:09] <wikibugs>	 (03CR) 10DCausse: Add makeTargetDir function to create target directory (032 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[13:42:11] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] "Looks great, thank you! Can't wait to test it :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204869 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski)
[13:42:37] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: add cassandra endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204869 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski)
[13:42:39] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:43:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370728 (10Marostegui) Thank you - I will reclone the host
[13:44:00] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1241.eqiad.wmnet onto db1262.eqiad.wmnet
[13:44:04] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1241 - Depool db1241.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003
[13:44:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370732 (10ops-monitoring-bot) Started cloning db1241.eqiad.wmnet to db1262.eqiad.wmnet - marostegui@cumin1003
[13:44:20] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add cassandra endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204869 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski)
[13:44:33] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1241 - Depool db1241.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003
[13:44:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11370734 (10ops-monitoring-bot) Completed depool of db1241 - Depool db1241.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003 - marostegui@cumin...
[13:44:49] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:50:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: release new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204864 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[13:51:42] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:51:52] <wikibugs>	 (03CR) 10Majavah: [C:03+1] maintain_dbusers: initialize the stats [puppet] - 10https://gerrit.wikimedia.org/r/1204820 (owner: 10David Caro)
[13:55:45] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:55:57] <wikibugs>	 (03CR) 10David Caro: [V:03+1 C:03+2] maintain_dbusers: initialize the stats [puppet] - 10https://gerrit.wikimedia.org/r/1204820 (owner: 10David Caro)
[13:58:23] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "lgtm, one nit about starting to use locally scoped var in bash functions" [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[13:58:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370745 (10Jclark-ctr) Swapped lswtest on Tuesday with the failed switch in D6, cabled it, and handed it over to Cathal for setup.  Today, re...
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1400).
[14:00:05] <jouncebot>	 mfossati and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:10] <Lucas_WMDE>	 o/
[14:00:16] <Lucas_WMDE>	 edsanders: do you want to start with your config change?
[14:00:17] <mfossati>	 hello!
[14:00:22] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "nice!" [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[14:00:33] <Lucas_WMDE>	 and then we can let the gate-and-submit for the backports run during that
[14:00:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370748 (10cmooney) >>! In T409731#11370745, @Jclark-ctr wrote: > Swapped lswtest on Tuesday with the failed switch in D6, cabled it, and han...
[14:03:24] <edsanders>	 hey, yeah
[14:03:28] <mfossati>	 Lucas_WMDE: I can self-deploy!
[14:03:35] <Lucas_WMDE>	 ok!
[14:03:50] <Lucas_WMDE>	 I’d still suggest edsanders goes first, just because your CI will probably take a couple of minutes anyway :)
[14:04:05] <mfossati>	 sure!
[14:04:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders)
[14:04:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370749 (10Jclark-ctr) 05Open→03Resolved a:05cmooney→03Jclark-ctr
[14:05:11] <wikibugs>	 (03Merged) 10jenkins-bot: Freeze LiquidThreads on enwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders)
[14:05:58] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1203028|Freeze LiquidThreads on enwikinews (T406717)]]
[14:06:02] <stashbot>	 T406717: Convert LQT pages on enwikinews to Flow - https://phabricator.wikimedia.org/T406717
[14:08:33] <logmsgbot>	 !log esanders@deploy2002 tchanders, esanders: Backport for [[gerrit:1203028|Freeze LiquidThreads on enwikinews (T406717)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:10:19] <logmsgbot>	 !log esanders@deploy2002 tchanders, esanders: Continuing with sync
[14:10:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11370790 (10Jclark-ctr) Checked for updates.   Parts will be available to ship on Thu, Nov 13, 2025.  @BTullis  they should arrive Friday pending no delays...
[14:10:56] <Lucas_WMDE>	 mfossati: are you going to deploy your backports together or separately?
[14:11:37] <mfossati>	 Lucas_WMDE: together is fine
[14:11:50] <Lucas_WMDE>	 ok, then I’ll just +2 them to start the build
[14:12:02] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204821 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati)
[14:12:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204822 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati)
[14:12:44] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org
[14:12:44] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy4002.wikimedia.org
[14:13:00] <mfossati>	 Lucas_WMDE: I'm on SpiderPig, usually it takes care of +2s. Is it aware of this?
[14:13:19] <wikibugs>	 (03Merged) 10jenkins-bot: ImageBrowsing: add tier 2 experiment [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204821 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati)
[14:13:58] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: add tier 2 experiment to ImageBrowsing [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204822 (https://phabricator.wikimedia.org/T409739) (owner: 10Marco Fossati)
[14:14:35] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203028|Freeze LiquidThreads on enwikinews (T406717)]] (duration: 08m 37s)
[14:14:39] <stashbot>	 T406717: Convert LQT pages on enwikinews to Flow - https://phabricator.wikimedia.org/T406717
[14:14:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11370807 (10AndrewTavis_WMDE)
[14:15:37] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2203.codfw.wmnet
[14:16:22] <mfossati>	 Lucas_WMDE: I'm ready to go, but please let me know if SpiderPig might conflict with your +2s
[14:16:31] <logmsgbot>	 !log kamila@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2203.codfw.wmnet
[14:16:33] <Lucas_WMDE>	 mfossati: you’re good to go
[14:16:41] <Lucas_WMDE>	 (also those builds were faster than I expected, nice ^^)
[14:16:48] <mfossati>	 all right, thanks
[14:17:27] <logmsgbot>	 !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1204821|ImageBrowsing: add tier 2 experiment (T409739)]], [[gerrit:1204822|xLab: add tier 2 experiment to ImageBrowsing (T409739)]]
[14:17:31] <stashbot>	 T409739: ImageBrowsing: launch the A/B test on English Wikipedia - https://phabricator.wikimedia.org/T409739
[14:19:27] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11370812 (10VRiley-WMF) 05Open→03Resolved This is completed.
[14:19:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:19:53] <logmsgbot>	 !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1204821|ImageBrowsing: add tier 2 experiment (T409739)]], [[gerrit:1204822|xLab: add tier 2 experiment to ImageBrowsing (T409739)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:20:04] <mfossati>	 checking
[14:20:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11370815 (10cmooney) 05Open→03Resolved So this has bounced a few times since, however it is relatively stable....
[14:21:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[14:24:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://phabricator.wikimedia.org/T407510#11370825 (10cmooney) I just resolved T407578 on this one.  I'll keep an eye on it though and if it gets worse we may ne...
[14:24:38] <Raine>	 !log homer lsw1-c6-codfw* commit 're-adding failed host -- T408004'
[14:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:42] <stashbot>	 T408004: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004
[14:25:17] <wikibugs>	 (03CR) 10Bking: [C:03+2] Configure reprepro to mirror upstream opensearch2 and opensearch3 repos [puppet] - 10https://gerrit.wikimedia.org/r/1204639 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking)
[14:25:20] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2203.codfw.wmnet
[14:25:22] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2203.codfw.wmnet
[14:25:50] <mfossati>	 please hold on, I'm exhaustively checking all wikis where the experiment is deployed :-)
[14:25:52] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] pontoon: introduce puppet::hosts function [puppet] - 10https://gerrit.wikimedia.org/r/1204360 (https://phabricator.wikimedia.org/T409905) (owner: 10Filippo Giunchedi)
[14:26:00] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] pontoon: inject netbox metadata for stack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204361 (https://phabricator.wikimedia.org/T409905) (owner: 10Filippo Giunchedi)
[14:26:12] <wikibugs>	 (03CR) 10Itamar Givon: Add makeTargetDir function to create target directory (033 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[14:26:18] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414)
[14:27:43] <Lucas_WMDE>	 mfossati: ALL_the_things.png
[14:28:32] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:29:13] <mfossati>	 it works (TM)
[14:29:19] <logmsgbot>	 !log mfossati@deploy2002 mfossati: Continuing with sync
[14:29:49] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] pontoon: clean puppet certs on host destroy [puppet] - 10https://gerrit.wikimedia.org/r/1204370 (https://phabricator.wikimedia.org/T409912) (owner: 10Filippo Giunchedi)
[14:30:05] <wikibugs>	 (03PS1) 10DCausse: cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734)
[14:30:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:30:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse)
[14:33:38] <logmsgbot>	 !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204821|ImageBrowsing: add tier 2 experiment (T409739)]], [[gerrit:1204822|xLab: add tier 2 experiment to ImageBrowsing (T409739)]] (duration: 16m 11s)
[14:33:42] <stashbot>	 T409739: ImageBrowsing: launch the A/B test on English Wikipedia - https://phabricator.wikimedia.org/T409739
[14:33:57] <mfossati>	 Lucas_WMDE: all done here
[14:34:04] <Lucas_WMDE>	 \o/
[14:34:09] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:15] <Lucas_WMDE>	 very relaxed window for me ;) thanks everyone!
[14:34:39] <mfossati>	 thank you mate :-)
[14:35:17] <wikibugs>	 (03PS2) 10Bartosz Wójtowicz: kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414)
[14:35:46] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:39:47] <wikibugs>	 (03CR) 10Itamar Givon: Refactor moveLinkFile and putDumpChecksums (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[14:41:07] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add missing Hiera entries for maps/staging [puppet] - 10https://gerrit.wikimedia.org/r/1204842 (owner: 10Muehlenhoff)
[14:41:26] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove cumin1002 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1204797 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[14:41:33] <Dreamy_Jazz>	 !log Ran `foreachwikiindblist checkuser-suggested-investigations.dblist extensions/CheckUser/maintenance/populateSicUrlIdentifier.php`
[14:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:05] <Dreamy_Jazz>	 !log Ran `foreachwikiindblist checkuser-suggested-investigations.dblist extensions/CheckUser/maintenance/populateSicUrlIdentifier.php` for T409564
[14:42:06] <wikibugs>	 (03PS3) 10Bartosz Wójtowicz: kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414)
[14:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:09] <stashbot>	 T409564: Suggested investigations: Populate sic_url_identifier for existing cusi_case rows - https://phabricator.wikimedia.org/T409564
[14:42:28] <wikibugs>	 (03PS1) 10Tiziano Fogli: check_icinga: add flags to suppress notifications/pages [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625)
[14:42:28] <wikibugs>	 (03CR) 10Tiziano Fogli: "I’m sorry, I forgot to run the formatter in a separate commit. I’ve marked the actual changes with a “real change” comment here in Gerrit " [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[14:44:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1204797 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[14:46:24] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:46:56] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410041 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[14:47:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041 (10ops-monitoring-bot) 03NEW
[14:47:35] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:47:55] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:47:56] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "sorrryyy chart bump!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:49:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add missing Hiera entries for maps/staging [puppet] - 10https://gerrit.wikimedia.org/r/1204842 (owner: 10Muehlenhoff)
[14:50:09] <kart_>	 !log Update Recommendation API to 2025-11-10-154629-production (T403730)
[14:50:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:13] <stashbot>	 T403730: Treat article translation on mobile as (lead) section translation - https://phabricator.wikimedia.org/T403730
[14:50:23] <wikibugs>	 (03PS4) 10Bking: opensearch-cluster: raise defaults to match design doc, disable upstream monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501)
[14:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:51:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Enable nftables on cluster::management on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1204368 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[14:51:54] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "Overall LGTM, I just left a non blocking comment." [puppet] - 10https://gerrit.wikimedia.org/r/1204804 (owner: 10Effie Mouzeli)
[14:51:54] <wikibugs>	 (03Merged) 10jenkins-bot: kserve-inference: Support loading secrets into environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204889 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:51:57] <wikibugs>	 (03PS5) 10Bking: opensearch-cluster: raise defaults to match design doc, disable upstream monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501)
[14:52:11] <wikibugs>	 (03PS4) 10Muehlenhoff: Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565)
[14:52:59] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414)
[14:53:45] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:54:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:56:03] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528#11370924 (10elukey) Created a new bucket with `swift post` and the Tegola AUTH credentials on thanos-fe1004:  ` root@thanos-fe1004:~# swift stat tegola-swift-staging-codfw-v001                       Account: AUT...
[14:56:23] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:57:19] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[14:57:55] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[14:59:34] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:00:20] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade  on VRTS host vrts1003.eqiad.wmnet
[15:01:25] <wikibugs>	 (03Merged) 10jenkins-bot: kserve-inference: Bump kserve-inference chart version to 0.4.17. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204894 (https://phabricator.wikimedia.org/T409414) (owner: 10Bartosz Wójtowicz)
[15:02:04] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0)  on VRTS host vrts1003.eqiad.wmnet
[15:02:48] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPv6 address from db-test1001.eqiad.wmnet - fceratto@cumin1003"
[15:02:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:03:50] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade  on VRTS host vrts1003.eqiad.wmnet
[15:05:12] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:05:15] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0)  on VRTS host vrts1003.eqiad.wmnet
[15:05:20] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.dns.netbox
[15:05:52] <logmsgbot>	 fceratto@cumin1003 netbox (PID 3325688) is awaiting input
[15:08:00] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:09:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Technically, this sets the _default_ resource requests/limits. If you really wanted to defined minimum resources, you'd need to define lim" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501) (owner: 10Bking)
[15:09:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[15:09:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[15:09:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[15:11:23] <federico3>	 there's a change for hosts/wikikube-worker2203.yaml pending commit by cookbooks.sre.dns.netbox
[15:11:38] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPv6 address from db-test1001.eqiad.wmnet - fceratto@cumin1003"
[15:11:39] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:16:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:23:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[15:23:49] <logmsgbot>	 !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab1003.wikimedia.org with reason: Update
[15:24:37] <logmsgbot>	 !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab2002.wikimedia.org with reason: Update
[15:26:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:27:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on  public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047 (10cmooney) 03NEW p:05Triage→03Medium
[15:28:02] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1530)
[15:30:57] <wikibugs>	 (03PS1) 10Brouberol: pg-airlfow-main: upscale the CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204899
[15:32:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on  public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371150 (10cmooney)
[15:32:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on  public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371154 (10cmooney)
[15:33:02] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:33:19] <wikibugs>	 (03CR) 10Btullis: [C:03+1] pg-airlfow-main: upscale the CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204899 (owner: 10Brouberol)
[15:33:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] pg-airlfow-main: upscale the CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204899 (owner: 10Brouberol)
[15:33:58] <inflatador>	 !log bking@apt1002 sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia /home/bking/wmf-opensearch-search-plugins-1.3.20+12-bullseye/wmf-opensearch-search-plugins_1.3.20+12_amd64.changes T407520
[15:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:02] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[15:34:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on  public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371173 (10cmooney)
[15:34:28] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply
[15:34:34] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply
[15:34:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:36:30] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[15:36:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565)
[15:37:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:37:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565)
[15:38:02] <jinxer-wm>	 FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:39:25] <wikibugs>	 (03PS3) 10Muehlenhoff: Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565)
[15:40:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371233 (10Reedy)
[15:41:37] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[15:50:36] <wikibugs>	 (03PS2) 10Itamar Givon: Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800)
[15:52:39] <wikibugs>	 (03PS2) 10Itamar Givon: Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800)
[15:53:52] <claime>	 Is someone looking at wikifeeds?
[15:54:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11371274 (10VRiley-WMF) @BTullis we have recieved the drive for this unit. Is there a time for us to replace this?
[15:56:36] <jinxer-wm>	 FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm  - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[15:57:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520
[15:57:37] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[15:58:15] <claime>	 nemo-yiannis: does wikifeeds hit recommendation API in the backend?
[15:58:32] <jinxer-wm>	 RESOLVED: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:58:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply wmf-opensearch-search-plugins update, other updates (see also T407110) - bking@cumin2002 - T407520
[15:59:01] <wikibugs>	 (03PS3) 10Itamar Givon: Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800)
[15:59:20] <robh>	 !log eqiad c/d migrations window start
[15:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:55] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ml-serve1003.eqiad.wmnet with reason: C/D Migration
[16:00:05] <jouncebot>	 andre and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1600).
[16:00:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove a lot of historical stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1204913 (https://phabricator.wikimedia.org/T381565)
[16:01:13] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[16:01:37] <jinxer-wm>	 RESOLVED: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm  - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[16:01:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11371319 (10Papaul) After swapping both PEM 2 and 3  ` re0.cr1-codfw> show chassis environment pem     PEM 0 status:   State...
[16:01:47] <claime>	 !log roll restarting mobileapps in codfw
[16:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: sync
[16:01:58] <wikibugs>	 (03PS2) 10Itamar Givon: Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800)
[16:02:08] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync
[16:02:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove the new unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565)
[16:02:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Add stub secrets for the staging role [labs/private] - 10https://gerrit.wikimedia.org/r/1204915 (https://phabricator.wikimedia.org/T409528)
[16:02:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 311 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1046, active_shards: 1559, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 311, delayed_unassigned_shards: 0
[16:02:20] <icinga-wm>	 _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.36898395721926 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:02:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 311 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1046, active_shards: 1559, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 311, delayed_unassigned_shards: 0
[16:02:20] <icinga-wm>	 _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.36898395721926 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:02:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 276 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1376, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 276, delayed_unassigned_shard
[16:02:21] <icinga-wm>	 mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.29297820823246 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:02:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 311 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1046, active_shards: 1559, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 311, delayed_unassigned_shards: 0
[16:03:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync
[16:03:56] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1027.eqiad.wmnet with reason: C/D Migration
[16:04:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync
[16:04:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520
[16:04:41] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[16:04:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:05:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1500, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 151, delayed_unassigned_shards: 0, number_of_pendin
[16:05:20] <icinga-wm>	  1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 90.79903147699758 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:05:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1741, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 125, delayed_unassigned_shards: 0, number_of_pending_t
[16:05:20] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.10160427807487 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:05:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1741, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 125, delayed_unassigned_shards: 0, number_of_pending_t
[16:05:20] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.10160427807487 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:06:14] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1034.eqiad.wmnet with reason: C/D Migration
[16:06:19] <inflatador>	 ^^ those opensearch alerts are expected, I thought our cook-book would set an alert suppression. Will have to look at that later
[16:06:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete grants file [puppet] - 10https://gerrit.wikimedia.org/r/1204916 (https://phabricator.wikimedia.org/T381565)
[16:06:53] <logmsgbot>	 marostegui@cumin1003 clone (PID 3241961) is awaiting input
[16:07:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204916 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:08:30] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:08:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Add stub secrets for the staging role [labs/private] - 10https://gerrit.wikimedia.org/r/1204915 (https://phabricator.wikimedia.org/T409528)
[16:08:57] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet
[16:09:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[16:09:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[16:09:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[16:10:10] <wikibugs>	 (03PS2) 10DCausse: cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734)
[16:10:44] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet
[16:11:09] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub secrets for the staging role [labs/private] - 10https://gerrit.wikimedia.org/r/1204915 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff)
[16:11:11] <wikibugs>	 (03PS3) 10DCausse: cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734)
[16:11:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1041, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0
[16:11:20] <icinga-wm>	 _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.31550802139037 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:11:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 275 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1377, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 275, delayed_unassigned_shard
[16:11:20] <icinga-wm>	 mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.35351089588377 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:11:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1041, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0
[16:11:30] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1028.eqiad.wmnet with reason: C/D Migration
[16:12:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:12:36] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528#11371360 (10elukey) After a chat with Moritz we realized that the better path is probably to create another account for staging, and create the new container in there. In this way we fully...
[16:13:25] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:14:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1634, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 228, delayed_unassigned_shards: 0, number_of_pending_t
[16:14:20] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.37967914438502 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:14:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1426, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 224, delayed_unassigned_shards: 0, number_of_pendin
[16:14:20] <icinga-wm>	  2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 39, active_shards_percent_as_number: 86.31961259079904 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:14:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1426, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 224, delayed_unassigned_shards: 0, number_of_pendin
[16:15:26] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet
[16:16:35] <robh>	 !log eqiad c/d migration project: ganeti hosts moving today with proper full drains 
[16:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:18:54] <moritzm>	 !log installing cups security updates
[16:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:59] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3001.wikimedia.org
[16:19:00] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[16:19:36] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org
[16:19:37] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet
[16:20:01] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1047.eqiad.wmnet with reason: C/D Migration
[16:20:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1045, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0
[16:20:20] <icinga-wm>	 _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.31550802139037 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:20:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1045, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0
[16:20:21] <icinga-wm>	 _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.31550802139037 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:20:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 269 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 808, active_shards: 1348, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 269, delayed_unassigned_shards:
[16:20:37] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet
[16:21:57] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet
[16:22:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1466, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 185, delayed_unassigned_shards: 0, number_of_pendin
[16:22:20] <icinga-wm>	  1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 88.7409200968523 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:22:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1444, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 171, delayed_unassigned_shards: 0, number_of_pending_ta
[16:22:20] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 89.30117501546073 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:22:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1686, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 178, delayed_unassigned_shards: 0, number_of_pending_t
[16:22:20] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 150, active_shards_percent_as_number: 90.16042780748663 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:22:22] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1048.eqiad.wmnet with reason: C/D Migration
[16:22:46] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[16:23:00] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3001.wikimedia.org - sukhe@cumin1003"
[16:23:05] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3001.wikimedia.org - sukhe@cumin1003"
[16:23:05] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:23:05] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3001.wikimedia.org on all recursors
[16:23:08] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3001.wikimedia.org on all recursors
[16:23:23] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[16:23:25] <jinxer-wm>	 RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:23:33] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet
[16:23:38] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3001.wikimedia.org - sukhe@cumin1003"
[16:23:42] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3001.wikimedia.org - sukhe@cumin1003"
[16:23:54] <jhathaway>	 !incidents
[16:23:54] <sirenbot>	 6998 (ACKED)  Primary outbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[16:24:17] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3001.wikimedia.org with OS trixie
[16:24:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371439 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[16:24:55] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:26:09] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet
[16:26:22] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1037.eqiad.wmnet with reason: C/D Migration
[16:26:48] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003"
[16:26:52] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003"
[16:26:52] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:26:53] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors
[16:26:56] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors
[16:27:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11371462 (10RobH) ganeti1028 ganeti1047 ganeti1048 ganeti1037  All migrated to new switch port after having the drain command run successfully aga...
[16:27:28] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003"
[16:27:32] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003"
[16:27:46] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[16:28:15] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7001.wikimedia.org with OS trixie
[16:28:19] <wikibugs>	 (03CR) 10Tjones: [C:03+1] "LGTM. (I don't have +2 in this repo.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse)
[16:28:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 275 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1377, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 275, delayed_unassigned_shard
[16:28:20] <icinga-wm>	 mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.35351089588377 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:28:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1043, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0
[16:28:20] <icinga-wm>	 _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.31550802139037 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:28:22] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 275 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1377, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 275, delayed_unassigned_shard
[16:28:22] <icinga-wm>	 mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.35351089588377 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:28:22] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 808, active_shards: 1347, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 270, delayed_unassigned_shards:
[16:28:22] <icinga-wm>	 er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.30241187384044 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:28:22] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 808, active_shards: 1347, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 270, delayed_unassigned_shards:
[16:28:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[16:29:30] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet
[16:29:55] <jinxer-wm>	 RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:31:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1636, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 226, delayed_unassigned_shards: 0, number_of_pending_t
[16:31:20] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.48663101604278 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1411, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 240, delayed_unassigned_shards: 0, number_of_pendin
[16:31:20] <icinga-wm>	  2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 44, active_shards_percent_as_number: 85.41162227602905 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1411, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 239, delayed_unassigned_shards: 0, number_of_pendin
[16:31:20] <icinga-wm>	  2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 42, active_shards_percent_as_number: 85.41162227602905 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1637, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 225, delayed_unassigned_shards: 0, number_of_pending_t
[16:31:22] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.54010695187165 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1637, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 225, delayed_unassigned_shards: 0, number_of_pending_t
[16:31:22] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.54010695187165 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1415, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 237, delayed_unassigned_shards: 0, number_of_pendin
[16:31:23] <icinga-wm>	  1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.65375302663438 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:24] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1637, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 225, delayed_unassigned_shards: 0, number_of_pending_t
[16:31:24] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.54010695187165 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:25] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1637, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 225, delayed_unassigned_shards: 0, number_of_pending_t
[16:31:25] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.54010695187165 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:26] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1415, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 237, delayed_unassigned_shards: 0, number_of_pendin
[16:31:26] <icinga-wm>	  1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.65375302663438 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1415, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 237, delayed_unassigned_shards: 0, number_of_pendin
[16:31:27] <icinga-wm>	  1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.65375302663438 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:31:28] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1377, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 238, delayed_unassigned_shards: 0, number_of_pending_ta
[16:31:28] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.15769944341372 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:32:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1538, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas
[16:32:20] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.11440940012369 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:32:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1538, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas
[16:32:20] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.11440940012369 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:32:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1539, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 76, delayed_unassigned_shards: 0, number_of_pending_tas
[16:32:22] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.17625231910947 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:32:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1540, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 76, delayed_unassigned_shards: 0, number_of_pending_tas
[16:32:22] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.23809523809523 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:32:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Add missing secret [labs/private] - 10https://gerrit.wikimedia.org/r/1204926 (https://phabricator.wikimedia.org/T409528)
[16:34:25] <wikibugs>	 (03PS3) 10Effie Mouzeli: prometheus: add temp recording rules for phpfpm_workers:active_percent [puppet] - 10https://gerrit.wikimedia.org/r/1204804
[16:34:35] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet
[16:35:03] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1045.eqiad.wmnet with reason: C/D Migration
[16:35:19] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:36:05] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet
[16:36:10] <jinxer-wm>	 FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:37:07] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:38:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 275 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1377, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 275, delayed_unassigned_shard
[16:38:21] <icinga-wm>	 mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.35351089588377 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:38:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 808, active_shards: 1347, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 270, delayed_unassigned_shards:
[16:38:21] <icinga-wm>	 er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.30241187384044 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:38:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1043, active_shards: 1558, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 312, delayed_unassigned_shards: 0
[16:39:03] <sukhe>	 inflatador: ^ known?
[16:39:28] <wikibugs>	 (03PS1) 10Dreamy Jazz: MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595)
[16:39:46] <inflatador>	 sukhe Y, I mentioned it above but I guess it got lost in the shuffle. Our cook-book might not be setting suppressions properly
[16:39:52] <inflatador>	 I'll set one now
[16:40:11] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1046.eqiad.wmnet
[16:40:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t
[16:40:22] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:40:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t
[16:40:22] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:40:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t
[16:40:22] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:40:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t
[16:40:23] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:40:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1614, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 256, delayed_unassigned_shards: 0, number_of_pending_t
[16:40:24] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.31016042780749 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:40:41] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1046.eqiad.wmnet with reason: C/D Migration
[16:41:08] <sukhe>	 inflatador: no worries and thanks!
[16:41:15] <sukhe>	 the only reason I was asking is because it was causing icinga-wm to quit
[16:41:20] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet
[16:41:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin
[16:41:22] <icinga-wm>	  0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin
[16:41:22] <icinga-wm>	  0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1537, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas
[16:41:22] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.05256648113792 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1537, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas
[16:41:23] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.05256648113792 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin
[16:41:24] <icinga-wm>	  0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:24] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1537, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas
[16:41:25] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 21, active_shards_percent_as_number: 95.05256648113792 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:25] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin
[16:41:26] <icinga-wm>	  0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:26] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1548, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 102, delayed_unassigned_shards: 0, number_of_pendin
[16:41:27] <icinga-wm>	  0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.7046004842615 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1537, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 79, delayed_unassigned_shards: 0, number_of_pending_tas
[16:41:28] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 29, active_shards_percent_as_number: 95.05256648113792 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:28] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1548, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 69, delayed_unassigned_shards: 0, number_of_pending_tas
[16:41:29] <icinga-wm>	 umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.73283858998145 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:44] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: T407520
[16:41:48] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[16:42:10] <wikibugs>	 (03CR) 10Effie Mouzeli: prometheus: add temp recording rules for phpfpm_workers:active_percent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1204804 (owner: 10Effie Mouzeli)
[16:42:12] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add missing secret [labs/private] - 10https://gerrit.wikimedia.org/r/1204926 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff)
[16:42:36] <Dreamy_Jazz>	 jouncebot: nowandnext
[16:42:36] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1600)
[16:42:36] <jouncebot>	 In 0 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700)
[16:42:40] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:43:12] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:43:36] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1049.eqiad.wmnet
[16:44:01] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1049.eqiad.wmnet with reason: C/D Migration
[16:44:39] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1050.eqiad.wmnet
[16:44:55] <wikibugs>	 10SRE-SLO, 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11371606 (10elukey) 05Resolved→03Open Let's keep it open until the alerts are up :)
[16:45:20] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:46:22] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 276 threshold =0.15 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 825, active_shards: 1376, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 276, delayed_unassigned_shard
[16:46:22] <icinga-wm>	 mber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.29297820823246 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:46:22] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 311 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 1045, active_shards: 1559, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 311, delayed_unassigned_shards: 0
[16:46:22] <icinga-wm>	 _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.36898395721926 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:46:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 807, active_shards: 1347, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 270, delayed_unassigned_shards: 0,
[16:46:23] <icinga-wm>	 of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.30241187384044 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:46:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:47:07] <wikibugs>	 (03CR) 10Harroyo-wmf: [C:03+1] MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz)
[16:47:08] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.226.0" for 2 host(s)
[16:47:08] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:47:16] <icinga-wm>	 PROBLEM - Host idp1005 is DOWN: PING CRITICAL - Packet loss = 100%
[16:47:19] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1050.eqiad.wmnet
[16:47:28] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:47:46] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1050.eqiad.wmnet with reason: C/D Migration
[16:48:09] <wikibugs>	 (03PS2) 10Dreamy Jazz: MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595)
[16:48:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:48:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 825, active_shards: 1450, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 202, delayed_unassigned_shards: 0, number_of_pendin
[16:48:22] <icinga-wm>	  1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.77239709443099 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:48:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1432, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 185, delayed_unassigned_shards: 0, number_of_pending_ta
[16:48:22] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 88.55905998763141 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:48:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 1099, active_shards: 1683, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 179, delayed_unassigned_shards: 0, number_of_pending_t
[16:48:22] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 194, active_shards_percent_as_number: 90.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:48:54] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.226.0" completed for 2 hosts
[16:49:15] <wikibugs>	 (03PS3) 10Dreamy Jazz: MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595)
[16:49:21] <wikibugs>	 (03CR) 10EarlyWarningBot: "[Failed command](https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php81/52088/consoleFull): `composer --ansi test`" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz)
[16:49:45] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet
[16:51:02] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet
[16:51:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply wmf-opensearch-search-plugins update, other updates (see also T407110) - bking@cumin2002 - T407520
[16:51:21] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[16:51:30] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1038.eqiad.wmnet with reason: C/D Migration
[16:51:53] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet
[16:52:50] <icinga-wm>	 PROBLEM - Host urldownloader1004 is DOWN: PING CRITICAL - Packet loss = 100%
[16:53:16] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:53:23] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:53:30] <wikibugs>	 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11371631 (10elukey) Really nice! I'll be afk next week for holidays, but @RLazarus may be...
[16:53:33] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy3001.wikimedia.org with reason: host reimage
[16:53:56] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Neat idea!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[16:54:46] <icinga-wm>	 RECOVERY - Host idp1005 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms
[16:55:02] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet
[16:55:11] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1046.eqiad.wmnet with reason: C/D Migration
[16:55:20] <icinga-wm>	 RECOVERY - Host urldownloader1004 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms
[16:56:04] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1051.eqiad.wmnet with reason: C/D Migration
[16:56:18] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1052.eqiad.wmnet
[16:56:55] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy3001.wikimedia.org with reason: host reimage
[16:57:22] <wikibugs>	 (03PS5) 10Dreamy Jazz: VisualEditor hCaptcha: Add config to disable onload handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204855 (https://phabricator.wikimedia.org/T409962) (owner: 10Kosta Harlan)
[16:57:27] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:58:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004#11371655 (10Raine) Thanks @Jhancock.wm , looks good!
[16:58:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:58:17] <Dreamy_Jazz>	 jouncebot: nowandnext
[16:58:17] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1600)
[16:58:17] <jouncebot>	 In 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700)
[16:58:23] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:58:32] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1052.eqiad.wmnet
[16:58:56] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on ganeti1052.eqiad.wmnet with reason: C/D Migration
[16:58:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz)
[16:58:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204855 (https://phabricator.wikimedia.org/T409962) (owner: 10Kosta Harlan)
[16:59:24] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] prometheus: add temp recording rules for phpfpm_workers:active_percent [puppet] - 10https://gerrit.wikimedia.org/r/1204804 (owner: 10Effie Mouzeli)
[17:00:04] <jouncebot>	 jhathaway and moritzm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:38] <wikibugs>	 (03PS2) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177)
[17:01:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11371693 (10RobH) All ganeti hosts migrated to their new switch ports in eqiad rows c/d
[17:01:28] <dancy>	 Dreamy_Jazz: Lemme know when you're done. I have a scap update to deploy.
[17:01:49] <Dreamy_Jazz>	 Sure, it should just be these patches I have to deploy
[17:01:58] <Dreamy_Jazz>	 But they are not merged yet so it could be a little bit
[17:02:04] <dancy>	 No prob.
[17:02:19] <moritzm>	 !log restarting Tomcat on idp1005
[17:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:24] <Dreamy_Jazz>	 If you want me to stop scap while they are still merging?
[17:02:30] <Dreamy_Jazz>	 The ETA on them being merged is 10 mins
[17:02:58] <Dreamy_Jazz>	 Not sure how long scap updates are usually though so happy either way
[17:03:12] <dancy>	 It takes about 2 minutes to update scap, so that would work for me.
[17:03:13] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:03:45] <wikibugs>	 (03CR) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[17:03:45] <Dreamy_Jazz>	 I interrupted https://spiderpig.wikimedia.org/jobs/915, so floor is yours
[17:03:49] <dancy>	 Thanks!
[17:03:57] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.227.0" for 2 host(s)
[17:04:14] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on restbase1040.eqiad.wmnet with reason: C/D Migration
[17:05:44] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.227.0" completed for 2 hosts
[17:06:01] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on restbase1041.eqiad.wmnet with reason: C/D Migration
[17:06:03] <dancy>	 Dreamy_Jazz: Back to you.
[17:06:07] <Dreamy_Jazz>	 Thanks!
[17:06:12] <wikibugs>	 (03PS1) 10Bvibber: StickyHeaders: scroll-margin-top fixes [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349)
[17:06:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz)
[17:06:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204855 (https://phabricator.wikimedia.org/T409962) (owner: 10Kosta Harlan)
[17:06:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) (owner: 10Bvibber)
[17:07:50] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on aqs1019.eqiad.wmnet with reason: C/D Migration
[17:08:34] <wikibugs>	 (03Merged) 10jenkins-bot: MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204927 (https://phabricator.wikimedia.org/T405595) (owner: 10Dreamy Jazz)
[17:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:11:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11371762 (10RobH) Day 5 Update:  * Moved all remaining ganeti hosts today * 17 hosts moved today, 108osts remain. * All remaining hosts are either k8 hosts (i...
[17:11:45] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[17:11:58] <robh>	 !log eqiad c/d migrations complete for today
[17:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:25] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy3001.wikimedia.org with OS trixie
[17:12:25] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy3001.wikimedia.org
[17:12:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[17:12:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:13:38] <wikibugs>	 (03CR) 10Bvibber: Reduce number of bucketsizes for MediaViewer (group0) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[17:14:03] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3002.wikimedia.org
[17:14:05] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[17:15:44] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 8.459 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[17:16:25] <wikibugs>	 (03Merged) 10jenkins-bot: VisualEditor hCaptcha: Add config to disable onload handling [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204855 (https://phabricator.wikimedia.org/T409962) (owner: 10Kosta Harlan)
[17:16:48] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1204927|MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling (T405595)]], [[gerrit:1204855|VisualEditor hCaptcha: Add config to disable onload handling (T409962)]]
[17:16:54] <stashbot>	 T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595
[17:16:54] <stashbot>	 T409962: hCaptcha VisualEditor: Don't render or load hCaptcha if hCaptcha is not yet enabled for that mode - https://phabricator.wikimedia.org/T409962
[17:17:18] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[17:17:36] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[17:17:36] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:17:37] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors
[17:17:38] <swfrench-wmf>	 jouncebot: nowandnext
[17:17:39] <jouncebot>	 For the next 0 hour(s) and 42 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700)
[17:17:39] <jouncebot>	 In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late - extended edition) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1730)
[17:17:40] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors
[17:18:00] <logmsgbot>	 !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host hcaptcha-proxy7001.wikimedia.org with OS trixie
[17:18:00] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host hcaptcha-proxy7001.wikimedia.org
[17:18:10] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[17:18:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[17:18:15] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[17:18:27] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[17:18:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11371805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[17:18:50] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz, kharlan: Backport for [[gerrit:1204927|MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling (T405595)]], [[gerrit:1204855|VisualEditor hCaptcha: Add config to disable onload handling (T409962)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:18:54] <wikibugs>	 (03CR) 10Dzahn: "arr.. so actually it's the other way around. this change would flip over which backend gets the traffic.. but you reminded me there IS a s" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[17:19:24] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 (owner: 10Effie Mouzeli)
[17:19:34] <swfrench-wmf>	 FYI, the MediaWiki infrastructure window is starting 30 minutes earlier than usual today to accommodate some complex changes. cc Dreamy_Jazz
[17:19:53] <Dreamy_Jazz>	 Okay. Should be done after this
[17:20:02] <swfrench-wmf>	 awesome, thanks!
[17:20:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:21:20] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz, kharlan: Continuing with sync
[17:22:02] <Dreamy_Jazz>	 Yeah, testing complete so will have no need to do more backports after this
[17:23:00] <wikibugs>	 (03PS1) 10Dzahn: releases: flip the active backend from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127)
[17:23:33] <wikibugs>	 (03CR) 10Dzahn: "that other change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204933" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[17:23:47] <wikibugs>	 (03CR) 10Dzahn: "will go closely with https://gerrit.wikimedia.org/r/c/operations/dns/+/1204684" [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[17:25:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:25:36] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204927|MakeGlobalVariablesScriptHookHandler: Fix hCaptcha site key handling (T405595)]], [[gerrit:1204855|VisualEditor hCaptcha: Add config to disable onload handling (T409962)]] (duration: 08m 48s)
[17:25:41] <stashbot>	 T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595
[17:25:42] <stashbot>	 T409962: hCaptcha VisualEditor: Don't render or load hCaptcha if hCaptcha is not yet enabled for that mode - https://phabricator.wikimedia.org/T409962
[17:25:52] <Dreamy_Jazz>	 swfrench-wmf: Over to you when ready
[17:25:59] <swfrench-wmf>	 Dreamy_Jazz: thanks!
[17:28:24] <wikibugs>	 (03CR) 10Clément Goubert: "This basically would make every call to mw-api-ext do a cross-dc DB call. I wonder how much latency this would add, but it may be acceptab" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[17:30:05] <jouncebot>	 jhathaway and moritzm: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1700).
[17:30:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:30:05] <jouncebot>	 swfrench: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late - extended edition). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1730).
[17:30:14] <swfrench-wmf>	 o/
[17:31:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:32:53] <swfrench-wmf>	 just working on some final tests and will get started shortly
[17:36:05] <wikibugs>	 (03CR) 10Scott French: "Thanks for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[17:36:06] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): return main to nominal multi-DC size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[17:38:06] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|web): return main to nominal multi-DC size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[17:39:35] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[17:39:54] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[17:40:10] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:41:18] <swfrench-wmf>	 !log scaled mw-api-ext/main to normal multi-DC size - T405955
[17:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:21] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[17:42:58] <wikibugs>	 (03CR) 10Scott French: [C:03+2] rest-gateway: Stop diverting PHP_ENGINE=8.3 to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203573 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[17:44:50] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Stop diverting PHP_ENGINE=8.3 to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203573 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[17:52:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[17:52:38] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[17:52:40] <wikibugs>	 (03PS3) 10Bearloga: EventStreamConfig: add stream for Growth and Editing team edit rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[17:58:03] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org
[17:58:04] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[17:58:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:58:52] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[18:00:05] <jouncebot>	 swfrench: gettimeofday() says it's time for MediaWiki infrastructure (UTC late - extended edition). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1730)
[18:00:05] <jouncebot>	 bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1800).
[18:00:46] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.622 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[18:01:28] <bd808>	 nothing for my window this week.
[18:01:36] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks for the patch! Not sure how I missed this in the review." [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede)
[18:02:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[18:02:25] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:02:28] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7001.wikimedia.org
[18:02:34] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[18:02:42] <swfrench-wmf>	 !log stopped diverting PHP_ENGINE-enrolled traffic at rest-gateway - T405955
[18:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:46] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:02:59] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[18:05:47] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:06:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:06:46] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removing hcaptcha-proxy7001;failed makevm - sukhe@cumin1003"
[18:06:58] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:07:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:07:20] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removing hcaptcha-proxy7001;failed makevm - sukhe@cumin1003"
[18:07:20] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:07:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520
[18:07:28] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[18:07:29] <swfrench-wmf>	 !log scaled mw-web/main to normal multi-DC size - T405955
[18:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:16] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org
[18:08:17] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[18:09:49] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[18:09:50] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy3002.wikimedia.org
[18:09:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[18:11:28] <swfrench-wmf>	 FYI, I am taking the scap lock to prevent deployments, which should not happen in our current capacity configuration
[18:11:33] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003"
[18:11:41] <logmsgbot>	 !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Hold deployments during mw-(api-ext|web) capacity changes - T405955
[18:11:45] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:11:48] <sukhe>	 (nice job on using that! we sometimes forget)
[18:11:48] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003"
[18:11:48] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:11:48] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors
[18:11:49] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 55 hosts with reason: T407520
[18:11:52] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors
[18:11:57] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[18:13:31] <swfrench-wmf>	 !log disable-puppet on A:cp hosts for ATS Lua config change - T405955
[18:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:22] <wikibugs>	 (03CR) 10Scott French: [C:03+2] trafficserver: disable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1203569 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:14:43] <wikibugs>	 (03CR) 10Bvibber: Reduce number of bucketsizes for MediaViewer (group0) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[18:15:03] <wikibugs>	 10ops-eqiad, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072 (10RobH) 03NEW p:05Triage→03Medium
[18:15:25] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003"
[18:15:41] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7001.wikimedia.org - sukhe@cumin1003"
[18:15:41] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:15:41] <wikibugs>	 (03CR) 10Eric Gardner: [C:03+1] StickyHeaders: scroll-margin-top fixes [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) (owner: 10Bvibber)
[18:15:41] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors
[18:15:45] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors
[18:15:49] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7001.wikimedia.org
[18:17:24] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox Cable report - incorrectly parsing Nokia power supplies - https://phabricator.wikimedia.org/T410073 (10RobH) 03NEW
[18:18:10] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox Cable report - incorrectly parsing Nokia power supplies - https://phabricator.wikimedia.org/T410073#11372087 (10RobH)
[18:18:20] <wikibugs>	 (03CR) 10Ssingh: "Question though: if the host is pooled, which will be most cases, how do we run this cookbook then? Like in the comment above as by Valent" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[18:18:39] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3002.wikimedia.org
[18:18:40] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[18:21:39] <wikibugs>	 06SRE, 06Data-Platform-SRE, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11372098 (10Ahoelzl)
[18:22:00] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[18:22:20] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[18:22:20] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:22:20] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors
[18:22:21] <swfrench-wmf>	 !log rolling run-puppet-agent on A:cp hosts for ATS Lua config change - T405955
[18:22:24] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors
[18:22:28] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[18:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:31] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:28:10] <logmsgbot>	 sukhe@cumin1003 makevm (PID 3555498) is awaiting input
[18:31:25] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[18:31:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[18:31:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[18:31:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[18:31:56] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[18:31:56] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:31:56] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors
[18:32:00] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors
[18:32:04] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy3002.wikimedia.org
[18:32:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:33:04] <wikibugs>	 (03CR) 10Bearloga: [C:04-1] EventStreamConfig: add stream for Growth and Editing team edit rates (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[18:40:37] <wikibugs>	 (03PS2) 10Bvibber: Reduce number of bucketsizes for MediaViewer (labs, group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165)
[18:42:05] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy3002.wikimedia.org
[18:42:22] <sukhe>	 !log manually running decomm cookbook on hcaptcha-proxy3002: host makevm failed, trying again T409860
[18:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:26] <stashbot>	 T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860
[18:43:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[18:44:23] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev: roll back horizon version to 2025-06-23-141023 [puppet] - 10https://gerrit.wikimedia.org/r/1204940
[18:44:25] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): return next to "idle" size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203574 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:45:50] <James_F>	 swfrench-wmf: Next up, 8.4? :-)
[18:45:55] <swfrench-wmf>	 hehe
[18:46:09] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[18:46:20] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|web): return next to "idle" size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203574 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:46:26] <James_F>	 (We don't even have CI voting for 8.4.)
[18:46:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: roll back horizon version to 2025-06-23-141023 [puppet] - 10https://gerrit.wikimedia.org/r/1204940 (owner: 10Andrew Bogott)
[18:46:32] <wikibugs>	 (03PS1) 10BCornwall: wmf-debci: Also create man1 dir [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1204941 (https://phabricator.wikimedia.org/T352003)
[18:48:05] <swfrench-wmf>	 !log zero external traffic on mw-(api-ext|web) next releases - T405955
[18:48:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:09] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:49:04] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:49:05] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy3002.wikimedia.org
[18:49:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372191 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for...
[18:49:55] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[18:50:09] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[18:50:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[18:50:31] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[18:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:51:55] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[18:52:03] <wikibugs>	 (03CR) 10BCornwall: "I remember seeing `forbes-bio.org`'s page, incidentally - it was talking about making your way to forbes' lists by having a professional w" [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor)
[18:52:09] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[18:52:15] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[18:52:26] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[18:53:02] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3002.wikimedia.org
[18:53:04] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[18:53:43] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:53:57] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:54:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:54:27] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:55:50] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:56:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:56:10] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:56:21] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[18:56:22] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:56:37] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[18:56:37] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:56:37] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors
[18:56:40] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors
[18:56:57] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410080 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[18:57:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410080 (10ops-monitoring-bot) 03NEW
[18:57:11] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[18:57:15] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[18:57:26] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[18:57:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[18:57:47] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:58:02] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:58:07] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:58:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:58:36] <logmsgbot>	 !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during mw-(api-ext|web) capacity changes - T405955 (duration: 46m 54s)
[18:58:39] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:59:32] <swfrench-wmf>	 !log scaled mw-(api-ext|web)/next to "idle" size - T405955
[18:59:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:05] <jouncebot>	 andre and jeena: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T1900).
[19:00:14] <andre>	 jouncebot: no!
[19:00:21] <jeena>	 🤣
[19:13:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:14:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:17:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041#11372271 (10VRiley-WMF)
[19:17:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410080#11372273 (10VRiley-WMF) →14Duplicate dup:03T410041
[19:20:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520
[19:20:10] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[19:22:43] <wikibugs>	 (03CR) 10CDobbins: "I'd assumed that it was failing due to the combination of the dry-run flag and the fact that this requires hosts to be depooled. If that's" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[19:25:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520
[19:25:14] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[19:27:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:27:54] <wikibugs>	 (03CR) 10Bking: [C:03+2] "ACK, we have changed the limitranges as well (ref Ic99ed2f2acf98d2be7723253821697525a46869f ), this will apply to the defaults as you said" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501) (owner: 10Bking)
[19:29:08] <wikibugs>	 (03PS1) 10Scott French: deployment_server: migrate mw-experimental to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1204945 (https://phabricator.wikimedia.org/T405955)
[19:29:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:35:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:37:59] <wikibugs>	 (03CR) 10Ssingh: "You are right, my bad. Confirming:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[19:40:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:41:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:47:13] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning
[19:47:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11372349 (10ops-monitoring-bot) Start pool of db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning - marostegui@cumin1003
[19:47:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041#11372350 (10VRiley-WMF) a:03VRiley-WMF
[19:47:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041#11372351 (10VRiley-WMF) 05Open→03Resolved This is a duplicate.
[19:48:34] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[19:48:34] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy3002.wikimedia.org
[19:48:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[19:51:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372373 (10ssingh) `hcaptcha-proxy3001` worked just fine but `hcaptcha-proxy3002` does not come...
[19:52:14] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy3002.wikimedia.org
[19:54:52] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[19:55:03] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[19:56:15] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[19:59:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372396 (10Jclark-ctr) https://netbox.wikimedia.org/dcim/cables/10192/  https://netbox.wikimedia.org/dcim/cables/10191/ These two are for lswtest-eqiad for those can be ignored for...
[19:59:40] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003"
[19:59:45] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003"
[19:59:45] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:59:46] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy3002.wikimedia.org
[19:59:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11372397 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for...
[20:00:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372399 (10Jclark-ctr)
[20:00:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372401 (10Jclark-ctr)
[20:02:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372406 (10Jclark-ctr) a:03Jclark-ctr
[20:02:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox cable report cleanup: unterminated cable ends - https://phabricator.wikimedia.org/T410072#11372411 (10Jclark-ctr) 05Open→03Resolved https://netbox.wikimedia.org/dcim/cables/1169/ was from T407008 xe-3/1/5  removed from netbox
[20:16:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:17:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:18:31] <wikibugs>	 10ops-eqiad, 06DC-Ops: Unresponsive management for cephosd1001.mgmt:22 - https://phabricator.wikimedia.org/T410088 (10phaultfinder) 03NEW
[20:22:32] <logmsgbot>	 !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test
[20:26:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11372503 (10Jclark-ctr)
[20:26:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410041#11372506 (10Jclark-ctr) →14Duplicate dup:03T409938
[20:30:13] <wikibugs>	 (03CR) 10Dzahn: "ACK! thank you. I am writing a short plan for the actual failover steps now and will include that.  Actually.. looking now if I can improv" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[20:32:41] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning
[20:32:44] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1241.eqiad.wmnet onto db1262.eqiad.wmnet
[20:32:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11372525 (10ops-monitoring-bot) Completed pool of db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning - marostegui@cumin1003
[20:32:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11372526 (10ops-monitoring-bot) Finished cloning db1241.eqiad.wmnet to db1262.eqiad.wmnet - marostegui@cumin1003
[20:38:33] <wikibugs>	 (03CR) 10Dzahn: "an actual plan: https://phabricator.wikimedia.org/P85324" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[20:45:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11372555 (10RobH) a:05MoritzMuehlenhoff→03LSobanski @LSobanski,  The only two #infrastructure-foundations hosts left to migrate are  >>! In T4...
[20:52:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply wmf-opensearch-search-plugins update - bking@cumin2002 - T407520
[20:52:40] <stashbot>	 T407520: Deploy various plugins to fix various things - https://phabricator.wikimedia.org/T407520
[20:56:00] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11372572 (10EdErhart-WMF) Hey folks, coming from the YoW team - I am speaking with an imperfect understanding of all the concerns involved, but would 25years.wikip...
[20:57:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza)
[20:58:01] <wikibugs>	 (03PS1) 10DLynch: Editcheck: flag suggestions when logging actions [extensions/VisualEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204957 (https://phabricator.wikimedia.org/T407170)
[20:58:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204957 (https://phabricator.wikimedia.org/T407170) (owner: 10DLynch)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T2100).
[21:00:04] <jouncebot>	 bvibber and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:08] <Kemayo>	 o/
[21:00:27] <bvibber>	 o/
[21:01:10] <Kemayo>	 I can deploy mine, or it's a small instrumentation change that'd be fine to just throw in with other patches.
[21:01:20] <bvibber>	 ok i'm logged into spiderpig
[21:01:31] <bvibber>	 i can do em all together, that should be fine :D
[21:01:36] <Kemayo>	 Works for me!
[21:01:37] <bvibber>	 yours look sice and clean
[21:01:41] <bvibber>	 *nice
[21:01:43] <bvibber>	 i can't type today :D
[21:02:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) (owner: 10Bvibber)
[21:02:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[21:02:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204957 (https://phabricator.wikimedia.org/T407170) (owner: 10DLynch)
[21:02:28] <Kemayo>	 Thus the beauty of spiderpig saving us from flubbing all those shell commands. 🤩
[21:03:13] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce number of bucketsizes for MediaViewer (labs, group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[21:03:21] <bvibber>	 hehehe exactly
[21:03:42] <bvibber>	 i let someone talk me into deploying from a bar once. never again ;)
[21:05:06] <Kemayo>	 At my previous job someone once showed off by deploying while riding the Space Mountain rollercoaster at Disneyland...
[21:06:09] <wikibugs>	 (03Merged) 10jenkins-bot: StickyHeaders: scroll-margin-top fixes [extensions/ReaderExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204928 (https://phabricator.wikimedia.org/T409349) (owner: 10Bvibber)
[21:06:29] <wikibugs>	 (03PS4) 10Aaron Schulz: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807)
[21:06:52] <wikibugs>	 (03CR) 10Aaron Schulz: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[21:07:05] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[21:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:12:52] <wikibugs>	 (03Merged) 10jenkins-bot: Editcheck: flag suggestions when logging actions [extensions/VisualEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204957 (https://phabricator.wikimedia.org/T407170) (owner: 10DLynch)
[21:13:15] <logmsgbot>	 !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1204928|StickyHeaders: scroll-margin-top fixes (T409349)]], [[gerrit:1204700|Reduce number of bucketsizes for MediaViewer (labs, group0) (T372165)]], [[gerrit:1204957|Editcheck: flag suggestions when logging actions (T407170)]]
[21:13:22] <stashbot>	 T409349: StickyHeaders: legacy parser h3-6 section links obscure content - https://phabricator.wikimedia.org/T409349
[21:13:22] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:13:23] <stashbot>	 T407170: Create Superset dashboard to see how edit suggestions are performing (overall and by type) - https://phabricator.wikimedia.org/T407170
[21:13:50] <James_F>	 Glad CI for wmf/* worked nice and smoothly and no-one noticed we switched from PHP 8.1 to 8.3. :-)
[21:14:06] <mutante>	 🔥
[21:15:17] <James_F>	 (Dropping PHP 8.1 from dev branch, and maybe REL1_45, coming Soon™.)
[21:15:26] <logmsgbot>	 !log bvibber@deploy2002 bvibber, kemayo: Backport for [[gerrit:1204928|StickyHeaders: scroll-margin-top fixes (T409349)]], [[gerrit:1204700|Reduce number of bucketsizes for MediaViewer (labs, group0) (T372165)]], [[gerrit:1204957|Editcheck: flag suggestions when logging actions (T407170)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:16:15] <bvibber>	 Kemayo: can yours be tested on test servers?
[21:16:17] <bvibber>	 mine look good
[21:16:33] <bvibber>	 James_F: wooooooo
[21:16:43] <Kemayo>	 bvibber: Yes, and I just tested it and it seems fine.
[21:16:48] <bvibber>	 awesome
[21:16:52] <logmsgbot>	 !log bvibber@deploy2002 bvibber, kemayo: Continuing with sync
[21:17:42] <James_F>	 Now I just need to work out how to convince Brooke to deploy from a bar again.
[21:17:52] <bvibber>	 step 1: buy me some beers
[21:17:59] <bvibber>	 maybe in milan ;)
[21:18:01] <James_F>	 Step 0: Go to someone Brooke is. :-)
[21:18:06] <James_F>	 Oooh, yes, Milan will be fun.
[21:18:16] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[21:18:23] <bvibber>	 i've been out of circulation for a while, it'll be good to get to a hackathon again :)
[21:18:23] * James_F is considering getting the train to Milan from maybe Paris, just because trains are fun.
[21:18:26] <James_F>	 <3
[21:19:01] <taavi>	 did i hear trains
[21:19:21] <James_F>	 taavi: https://www.trenitalia.com/en/frecce/frecciarossa.html
[21:19:36] <mutante>	 do the Thalys. Paris - Bruxelles - Cologne  - oh nooo.. whaaat.. Wikipedia article uses "was"   https://en.wikipedia.org/wiki/Thalys
[21:20:22] <taavi>	 that's called eurostar these days, not to be confused with the old eurostar
[21:20:25] <James_F>	 mutante: If I start from London it'd be nice to only change once. LON -> BRU -> CGN -> … is a bit much.
[21:20:39] <James_F>	 mutante: As opposed to LON -> PAR -> MLN.
[21:20:41] <mutante>	 "Eurostar" then 
[21:20:54] <taavi>	 James_F: unfortunately the hackathon is happening on the one weekend next year where I have a conflict and won't be able to make it :(
[21:21:10] <James_F>	 taavi: Boooooo. Will you at least make it to Paris for Wikimania?
[21:21:20] <James_F>	 All these plans to get things done.
[21:21:50] <taavi>	 I hope to, but can't say for sure yet
[21:22:13] <logmsgbot>	 !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204928|StickyHeaders: scroll-margin-top fixes (T409349)]], [[gerrit:1204700|Reduce number of bucketsizes for MediaViewer (labs, group0) (T372165)]], [[gerrit:1204957|Editcheck: flag suggestions when logging actions (T407170)]] (duration: 08m 58s)
[21:22:14] <bvibber>	 no direct flights to italy from portland, i'll have to change planes and/or trains somewhere. might come up with something clever :)
[21:22:16] <mutante>	 you should name mediawiki releases/sprints after famous trains
[21:22:20] <stashbot>	 T409349: StickyHeaders: legacy parser h3-6 section links obscure content - https://phabricator.wikimedia.org/T409349
[21:22:20] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:22:20] <stashbot>	 T407170: Create Superset dashboard to see how edit suggestions are performing (overall and by type) - https://phabricator.wikimedia.org/T407170
[21:22:21] <bvibber>	 Kemayo: done!
[21:22:29] <Kemayo>	 bvibber: Thanks!
[21:22:54] <kostajh>	 I have a config patch to sync, when you are done
[21:23:39] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[21:23:43] <mutante>	 bvibber: Portland -> Burbank -> (robot taxi) -> LAX -> Rome  
[21:24:33] <bvibber>	 kostajh: all rady, you need me to run it?
[21:24:38] <mutante>	 well, or just via LAX.. just saying Burbank because it's so much smaller
[21:24:52] <bvibber>	 *ready
[21:26:31] <kostajh>	 bvibber: I can sync it, thanks
[21:26:34] <bvibber>	 cool
[21:27:01] <bvibber>	 mutante: lax is pretty far out of the way; great circle route from Portland to Milan says better places to change planes are ... Iceland or London :D
[21:27:25] <bvibber>	 if the iceland seasonal is there i should totally book that
[21:28:35] <mutante>	 bvibber: oh, yea. then maybe check if you can get 23h50m layover. if it's under 24 hours it is still one ticket. but a full night and day to see something, sleep and then continue travel can be so nice
[21:29:00] <wikibugs>	 (03PS1) 10Scott French: Deploy known-client rate limits and multi-select fixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1204963
[21:31:32] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204965
[21:31:34] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Tested locally at `3569d73b27557b50a12f73287a7a139ccae0f4ec`." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1204963 (owner: 10Scott French)
[21:32:08] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] Deploy known-client rate limits and multi-select fixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1204963 (owner: 10Scott French)
[21:32:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:33:11] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: known-client rate limits and multi-select fixes - swfrench@cumin2002"
[21:33:13] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: known-client rate limits and multi-select fixes - swfrench@cumin2002
[21:34:02] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: known-client rate limits and multi-select fixes - swfrench@cumin2002
[21:34:04] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: known-client rate limits and multi-select fixes - swfrench@cumin2002"
[21:35:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:35:56] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11372753 (10Dzahn) I think we can roughly order the options like this, from easiest / most standard to least recommended / potentially most problematic:  - SOMETHI...
[21:36:51] <kostajh>	 jouncebot: nowandnext
[21:36:52] <jouncebot>	 For the next 0 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T2100)
[21:36:52] <jouncebot>	 In 0 hour(s) and 23 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T2200)
[21:39:37] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204966 (https://phabricator.wikimedia.org/T405586)
[21:39:42] <logmsgbot>	 !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1003.eqiad.wmnet with reason: sleep test
[21:40:04] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm
[21:43:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204966 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[21:44:11] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204966 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[21:44:30] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204966|hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits (T405586)]]
[21:44:34] <stashbot>	 T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
[21:46:34] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1204966|hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:49:22] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[21:53:26] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204966|hCaptcha: Simplify ConfirmEditTriggersCaptcha logic for API edits (T405586)]] (duration: 08m 56s)
[21:53:31] <stashbot>	 T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251113T2200)
[22:00:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:02:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:07:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:12:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:28:53] <logmsgbot>	 !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test
[22:31:35] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "+1 to moving ahead with this in the interim to keep the action-API migration moving forward." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[22:31:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[22:31:59] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[22:31:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[22:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:57:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: 10Jdlrobson)
[22:59:06] <wikibugs>	 (03Merged) 10jenkins-bot: Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: 10Jdlrobson)
[22:59:24] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1200173|Drop references to removed Advanced mobile contribution configuration (T402470)]]
[22:59:29] <stashbot>	 T402470: Remove AMC Outreach code from Mobile - https://phabricator.wikimedia.org/T402470
[23:01:52] <logmsgbot>	 !log catrope@deploy2002 catrope, jdlrobson: Backport for [[gerrit:1200173|Drop references to removed Advanced mobile contribution configuration (T402470)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:04:46] <logmsgbot>	 !log catrope@deploy2002 catrope, jdlrobson: Continuing with sync
[23:08:54] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200173|Drop references to removed Advanced mobile contribution configuration (T402470)]] (duration: 09m 30s)
[23:08:59] <stashbot>	 T402470: Remove AMC Outreach code from Mobile - https://phabricator.wikimedia.org/T402470
[23:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:15:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: community_civicrm-cv-job-run.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:21:51] <wikibugs>	 (03PS1) 10Jforrester: Enable embedded Wikifunctions on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204979 (https://phabricator.wikimedia.org/T401683)
[23:22:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204979 (https://phabricator.wikimedia.org/T401683) (owner: 10Jforrester)
[23:27:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:30:45] <logmsgbot>	 !log denisse@deploy2002 Started deploy [librenms/librenms@5fca3ff]: Upgrade LibreNMS to 25.10.0 - T410039
[23:31:00] <logmsgbot>	 !log denisse@deploy2002 Finished deploy [librenms/librenms@5fca3ff]: Upgrade LibreNMS to 25.10.0 - T410039 (duration: 00m 15s)
[23:32:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:39:10] <wikibugs>	 (03PS1) 10Dzahn: releases: control jenkins service by DC name, not host name [puppet] - 10https://gerrit.wikimedia.org/r/1204980 (https://phabricator.wikimedia.org/T392127)
[23:41:08] <wikibugs>	 (03PS2) 10Dzahn: releases: control jenkins service by DC name, not host name [puppet] - 10https://gerrit.wikimedia.org/r/1204980 (https://phabricator.wikimedia.org/T392127)
[23:42:39] <wikibugs>	 (03PS1) 10Dzahn: releases: stop/mask jenkins in eqiad, start/unmask jenkins in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204982 (https://phabricator.wikimedia.org/T392127)
[23:47:29] <wikibugs>	 (03CR) 10Dzahn: "> My answer was the we will also need unmask/enable the Jenkins service manually in the new primary. Conversely we will have to mask/disab" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)