[00:02:19] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1204092 [00:02:23] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 [00:02:27] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204094 [00:03:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11364943 (10VRiley-WMF) Checked on this ticket, and the order is processing [00:04:09] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11364944 (10VRiley-WMF) Checked on this ticket, the order is processing [00:08:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T407997)', diff saved to https://phabricator.wikimedia.org/P85240 and previous config saved to /var/cache/conftool/dbconfig/20251112-000857-marostegui.json [00:09:02] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [00:09:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1261.eqiad.wmnet with reason: Maintenance [00:09:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T407997)', diff saved to https://phabricator.wikimedia.org/P85241 and previous config saved to /var/cache/conftool/dbconfig/20251112-000922-marostegui.json [00:13:32] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11364950 (10VRiley-WMF) Juniper reached out to me saying they had the wrong address in their system. Gave them the address. [00:16:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T407997)', diff saved to https://phabricator.wikimedia.org/P85242 and previous config saved to /var/cache/conftool/dbconfig/20251112-001604-marostegui.json [00:16:09] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [00:29:06] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P85243 and previous config saved to /var/cache/conftool/dbconfig/20251112-003112-marostegui.json [00:32:13] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204096 [00:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204096 (owner: 10TrainBranchBot) [00:46:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P85244 and previous config saved to /var/cache/conftool/dbconfig/20251112-004620-marostegui.json [00:50:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204096 (owner: 10TrainBranchBot) [00:51:11] (03CR) 10Pppery: "All of these except for `thetimespedia.in` and `forbes-bio.org` should clearly go to the paid editing blog post. I have absolutely no idea" [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor) [01:00:49] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T407997)', diff saved to https://phabricator.wikimedia.org/P85245 and previous config saved to /var/cache/conftool/dbconfig/20251112-010128-marostegui.json [01:01:32] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [01:01:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1263.eqiad.wmnet with reason: Maintenance [01:01:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T407997)', diff saved to https://phabricator.wikimedia.org/P85246 and previous config saved to /var/cache/conftool/dbconfig/20251112-010151-marostegui.json [01:06:31] (03PS1) 10Superpes15: Throttle exemption for Edit-a-thon in Hong Kong - 15 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204099 (https://phabricator.wikimedia.org/T409852) [01:08:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204100 [01:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204100 (owner: 10TrainBranchBot) [01:08:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T407997)', diff saved to https://phabricator.wikimedia.org/P85247 and previous config saved to /var/cache/conftool/dbconfig/20251112-010828-marostegui.json [01:08:32] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [01:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:15:03] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 14s) [01:23:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P85248 and previous config saved to /var/cache/conftool/dbconfig/20251112-012335-marostegui.json [01:29:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204100 (owner: 10TrainBranchBot) [01:38:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P85249 and previous config saved to /var/cache/conftool/dbconfig/20251112-013843-marostegui.json [01:53:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T407997)', diff saved to https://phabricator.wikimedia.org/P85250 and previous config saved to /var/cache/conftool/dbconfig/20251112-015351-marostegui.json [01:53:55] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [01:54:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:48:22] (03CR) 10Robertsky: [C:03+1] Throttle exemption for Edit-a-thon in Hong Kong - 15 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204099 (https://phabricator.wikimedia.org/T409852) (owner: 10Superpes15) [02:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:54:59] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!! I just added some non-blocking comments." [puppet] - 10https://gerrit.wikimedia.org/r/1203845 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [03:03:22] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:21:58] RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:32:28] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86654536 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:33:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [03:33:51] Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:34:28] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 13568 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:35:26] fceratto@cumin1002 clone (PID 3274848) is awaiting input [03:37:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:38:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [03:38:51] Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:42:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:18:56] (03PS5) 10KartikMistry: machinetranslation: Increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) [05:29:25] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11365067 (10Chandra-WMDE) [05:33:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:21:09] (03PS1) 10Kevin Bazira: ml-services: deploy revertrisk-wikidata to the revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204108 (https://phabricator.wikimedia.org/T406179) [06:36:07] (03PS2) 10KartikMistry: Apertium: Update to 2025-11-10-034557-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203644 (https://phabricator.wikimedia.org/T408515) [06:46:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2212 with weight 0 T409255', diff saved to https://phabricator.wikimedia.org/P85251 and previous config saved to /var/cache/conftool/dbconfig/20251112-064643-marostegui.json [06:46:48] T409255: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T409255 [06:47:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T409255 [06:47:28] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1201900 (https://phabricator.wikimedia.org/T409255) (owner: 10Gerrit maintenance bot) [06:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:52:48] !log Starting s1 codfw failover from db2203 to db2212 - T409255 [06:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:52] T409255: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T409255 [06:53:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s1 codfw as read-only for maintenance - T409255', diff saved to https://phabricator.wikimedia.org/P85252 and previous config saved to /var/cache/conftool/dbconfig/20251112-065259-marostegui.json [06:53:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2212 to s1 primary and set section read-write T409255', diff saved to https://phabricator.wikimedia.org/P85253 and previous config saved to /var/cache/conftool/dbconfig/20251112-065321-marostegui.json [06:53:40] (03CR) 10Marostegui: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1201901 (https://phabricator.wikimedia.org/T409255) (owner: 10Gerrit maintenance bot) [06:53:45] !log marostegui@dns1006 START - running authdns-update [06:54:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2203 T409255', diff saved to https://phabricator.wikimedia.org/P85254 and previous config saved to /var/cache/conftool/dbconfig/20251112-065426-marostegui.json [06:54:41] !log marostegui@dns1006 END - running authdns-update [06:55:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:56:21] (03PS1) 10Marostegui: db2203: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1204111 (https://phabricator.wikimedia.org/T407463) [06:56:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1039 gradually with 4 steps - Pool es1039.eqiad.wmnet in after cloning [06:57:16] (03CR) 10Marostegui: [C:03+2] db2203: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1204111 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:58:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2203.codfw.wmnet with reason: Maintenance [06:58:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db2203 - Depool db2203 for migration to mariadb 10.11 [06:58:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2203 - Depool db2203 for migration to mariadb 10.11 [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T0700) [07:03:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2179.codfw.wmnet with reason: Maintenance [07:04:06] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2203 gradually with 4 steps - Repooling after upgrade [07:11:05] (03PS1) 10Kosta Harlan: Throttler: Use SecurityLogContext [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204112 [07:13:11] Update Apertium service.. [07:17:49] (03CR) 10KartikMistry: [C:03+2] Apertium: Update to 2025-11-10-034557-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203644 (https://phabricator.wikimedia.org/T408515) (owner: 10KartikMistry) [07:19:27] (03Merged) 10jenkins-bot: Apertium: Update to 2025-11-10-034557-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203644 (https://phabricator.wikimedia.org/T408515) (owner: 10KartikMistry) [07:21:46] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11365193 (10Chandra-WMDE) @AndrewTavis_WMDE I have updated the public key - hope this will work. :) [07:21:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203829 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [07:22:47] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/apertium: apply [07:23:22] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [07:28:20] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply [07:28:59] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [07:34:09] !log Update Apertium to 2025-11-10-034557-production (T408515) [07:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:13] T408515: Update Apertium service to Trixie - https://phabricator.wikimedia.org/T408515 [07:37:02] (03PS1) 10Kosta Harlan: Refactor CaptchaScoreHooks to use EventSubmitter [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204163 (https://phabricator.wikimedia.org/T405597) [07:37:54] (03PS1) 10KartikMistry: Update cxserver to 2025-11-12-072333-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204172 [07:39:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203046 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:40:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204163 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [07:41:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1039 gradually with 4 steps - Pool es1039.eqiad.wmnet in after cloning [07:42:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es1039.eqiad.wmnet onto es1033.eqiad.wmnet [07:47:54] (03PS1) 10Kosta Harlan: hCaptcha instrumentation: Handle hcaptcha.render events during edits [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204238 (https://phabricator.wikimedia.org/T409701) [07:48:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204238 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [07:48:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204112 (owner: 10Kosta Harlan) [07:51:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2203 gradually with 4 steps - Repooling after upgrade [07:56:04] (03PS2) 10Muehlenhoff: Also switch cumin2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1160133 (https://phabricator.wikimedia.org/T389380) [07:56:29] (03PS3) 10Muehlenhoff: Also switch cumin2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1160133 (https://phabricator.wikimedia.org/T389380) [08:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T0800). [08:00:05] Tran, kostajh, dcausse, and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:13] o/ [08:00:34] 👋 [08:00:44] \o [08:00:45] hi [08:00:53] I'll sync my patches at the end [08:01:36] FYI my patch is very simple and doesn't require any testing, so it can be merged with any other patch, without any issue. The only thing you need to do is run resetAuthenticationThrottle.php (for both enwiki and zhwiki) since there are less than 72 hours between deployment and the event. [08:02:16] We'll get started with temp accounts deploy [08:02:20] Alright I can start with my patch? We're deploying temp accounts to some wikis and will need some time to test. [08:03:38] sounds good [08:06:08] (03PS3) 10STran: Deploy temporary accounts to more large/LQT-unblocked projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) [08:06:49] (03CR) 10Tchanders: [C:03+1] Deploy temporary accounts to more large/LQT-unblocked projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) (owner: 10STran) [08:07:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) (owner: 10STran) [08:08:08] (03Merged) 10jenkins-bot: Deploy temporary accounts to more large/LQT-unblocked projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) (owner: 10STran) [08:09:09] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1203416|Deploy temporary accounts to more large/LQT-unblocked projects (T409691)]] [08:09:13] T409691: Deploy Temporary accounts to Spanish, Commons, Wikidata and others - https://phabricator.wikimedia.org/T409691 [08:11:43] !log stran@deploy2002 stran: Backport for [[gerrit:1203416|Deploy temporary accounts to more large/LQT-unblocked projects (T409691)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:12:16] (03PS1) 10Kosta Harlan: CheckUser/UserInfoCard: Enable by default for some privileged groups on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204243 (https://phabricator.wikimedia.org/T409840) [08:14:08] testing temp accounts... [08:14:27] (03PS2) 10Muehlenhoff: Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) [08:16:15] (03PS1) 10Awight: Hide edit one/all checkbox when needed [extensions/Cite] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204244 (https://phabricator.wikimedia.org/T409808) [08:16:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Cite] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204244 (https://phabricator.wikimedia.org/T409808) (owner: 10Awight) [08:16:56] (03PS2) 10Kosta Harlan: CheckUser/UserInfoCard: Enable by default for some privileged groups on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204243 (https://phabricator.wikimedia.org/T409840) [08:18:49] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Hide edit one/all checkbox when needed [extensions/Cite] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204244 (https://phabricator.wikimedia.org/T409808) (owner: 10Awight) [08:20:07] !log stran@deploy2002 stran: Continuing with sync [08:21:49] (03CR) 10Brouberol: Enable an oauth2-proxy for growthbook frontend and api pods (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [08:24:28] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203416|Deploy temporary accounts to more large/LQT-unblocked projects (T409691)]] (duration: 15m 19s) [08:24:32] T409691: Deploy Temporary accounts to Spanish, Commons, Wikidata and others - https://phabricator.wikimedia.org/T409691 [08:25:10] (03CR) 10JMeybohm: [C:03+1] deployment_server: fully migrate mw-(api-ext|web) to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203559 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [08:26:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:26:25] Mine's done, feel free to start the next. Thank you for your patience while we were testing 🙇 [08:26:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:27:01] dcausse: do you want to go next? and can you deploy Superpes15 patch as well? [08:27:17] kostajh: sure [08:27:30] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:28:03] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:28:18] Superpes15: o/, would you mind if I deploy both your and my patch at once? [08:28:25] (03CR) 10JMeybohm: [C:03+1] mw-(api-ext|web): return capacity from migration to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203571 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [08:28:30] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198937 (https://phabricator.wikimedia.org/T408223) [08:28:58] dcausse Absolutely! mine is quite simple and doesn't require testing :) [08:29:31] ok shipping them then :) [08:29:52] Just remember to run resetAuthenticationThrottle.php at the end for both enwiki and zhwiki :) [08:29:59] Thanks! [08:30:02] ack [08:30:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204099 (https://phabricator.wikimedia.org/T409852) (owner: 10Superpes15) [08:30:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203046 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:31:15] (03Merged) 10jenkins-bot: Throttle exemption for Edit-a-thon in Hong Kong - 15 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204099 (https://phabricator.wikimedia.org/T409852) (owner: 10Superpes15) [08:31:23] (03Merged) 10jenkins-bot: cirrus: start A/B test on completion with default_sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203046 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:31:54] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1204099|Throttle exemption for Edit-a-thon in Hong Kong - 15 November 2025 (T409852)]], [[gerrit:1203046|cirrus: start A/B test on completion with default_sort (T404858)]] [08:32:00] T409852: Requesting temporary lift of IP cap for 15/11 edit-a-thon - https://phabricator.wikimedia.org/T409852 [08:32:00] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:33:31] (03CR) 10Elukey: "Yeah I am really sorry Cathal! I didn't realize that I typoed the commit msg :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 (owner: 10Elukey) [08:34:13] !log dcausse@deploy2002 dcausse, superpes: Backport for [[gerrit:1204099|Throttle exemption for Edit-a-thon in Hong Kong - 15 November 2025 (T409852)]], [[gerrit:1203046|cirrus: start A/B test on completion with default_sort (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:34:20] testing [08:37:54] (03CR) 10Clément Goubert: Note that per-route rate limits require Envoy 1.33 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 (owner: 10Daniel Kinzler) [08:38:08] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198937 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [08:40:20] (03PS4) 10Daniel Kinzler: Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 [08:40:29] (03CR) 10CI reject: [V:04-1] Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 (owner: 10Daniel Kinzler) [08:40:35] !log dcausse@deploy2002 dcausse, superpes: Continuing with sync [08:42:17] (03PS5) 10Daniel Kinzler: Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 [08:42:24] claime: --^ [08:42:52] duesen: tyvm <3 [08:43:06] duesen: gotpl whitespace is finicky :P [08:44:51] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204099|Throttle exemption for Edit-a-thon in Hong Kong - 15 November 2025 (T409852)]], [[gerrit:1203046|cirrus: start A/B test on completion with default_sort (T404858)]] (duration: 12m 57s) [08:44:57] T409852: Requesting temporary lift of IP cap for 15/11 edit-a-thon - https://phabricator.wikimedia.org/T409852 [08:44:57] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:45:02] (03CR) 10Gkyziridis: [C:03+1] "Thnx for deploying!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204108 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [08:45:04] dcausse: I'll start my pathces now, ok? [08:45:08] *patches [08:45:23] kostajh: yes [08:46:33] !log dcausse@deploy2002 mwscript-k8s job started: resetAuthenticationThrottle.php zhwiki # T409852 [08:46:46] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11365310 (10fgiunchedi) [08:47:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204238 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [08:48:18] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7604/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [08:48:23] Superpes15: I'm not very familiar with resetAuthenticationThrottle.php what options do I need to pass for your patch? [08:48:37] (03Merged) 10jenkins-bot: hCaptcha instrumentation: Handle hcaptcha.render events during edits [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204238 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [08:49:09] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204238|hCaptcha instrumentation: Handle hcaptcha.render events during edits (T409701 T409703 T409415)]] [08:49:16] T409701: hCaptcha: Log challenge event as "saveFailure" in EditAttemptStep - https://phabricator.wikimedia.org/T409701 [08:49:16] T409703: hCaptcha: Log challenge close and expiry events to VisualEditorFeatureUse - https://phabricator.wikimedia.org/T409703 [08:49:17] T409415: hCaptcha: Track events for edits in Prometheus - https://phabricator.wikimedia.org/T409415 [08:49:57] (03CR) 10Filippo Giunchedi: [V:03+1] "Now in the description of https://phabricator.wikimedia.org/T399180 including recovery if an host doesn't come back" [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [08:51:23] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1204238|hCaptcha instrumentation: Handle hcaptcha.render events during edits (T409701 T409703 T409415)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:52:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893 (10MGerlach) 03NEW [08:52:58] !log kharlan@deploy2002 kharlan: Continuing with sync [08:53:30] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11365347 (10MGerlach) Hi. we have a new formal collaborator with the Research Team: @AnkitaM. They need access to the stat machines for a new research project. Let me know if... [08:53:39] !log dcausse@deploy2002 mwscript-k8s job started: resetAuthenticationThrottle.php zhwiki --signup --ip=1.2.3.4 # resetting throttle cache for T409852 [08:53:43] T409852: Requesting temporary lift of IP cap for 15/11 edit-a-thon - https://phabricator.wikimedia.org/T409852 [08:54:33] (03PS1) 10Marostegui: es1033: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1204247 (https://phabricator.wikimedia.org/T409257) [08:54:33] not running on enwiki since it does not matter according to https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [08:55:18] (03CR) 10Marostegui: [C:03+2] es1033: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1204247 (https://phabricator.wikimedia.org/T409257) (owner: 10Marostegui) [08:55:26] (03CR) 10Kevin Bazira: [C:03+2] ml-services: deploy revertrisk-wikidata to the revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204108 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [08:56:20] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [08:56:40] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894 (10MGerlach) 03NEW [08:57:06] (03Merged) 10jenkins-bot: ml-services: deploy revertrisk-wikidata to the revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204108 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [08:57:51] andre: I still have a few more backports to do during this window. Is it ok to start the train a little later? [08:57:52] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:58:14] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:58:27] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204238|hCaptcha instrumentation: Handle hcaptcha.render events during edits (T409701 T409703 T409415)]] (duration: 09m 17s) [08:58:32] T409701: hCaptcha: Log challenge event as "saveFailure" in EditAttemptStep - https://phabricator.wikimedia.org/T409701 [08:58:33] T409703: hCaptcha: Log challenge close and expiry events to VisualEditorFeatureUse - https://phabricator.wikimedia.org/T409703 [08:58:33] T409415: hCaptcha: Track events for edits in Prometheus - https://phabricator.wikimedia.org/T409415 [08:59:12] kostajh, the train is blocked anyway, see wikitech-l@ [08:59:20] right [08:59:21] ok [08:59:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 1%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85264 and previous config saved to /var/cache/conftool/dbconfig/20251112-085941-root.json [08:59:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204088 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [08:59:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204163 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [09:00:05] andre and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T0900). [09:00:08] (03CR) 10Elukey: [C:03+1] "LGTM! We'll have to remember it if we'll buy other jumbo hosts, hopefully not soon :D" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1204072 (owner: 10Cathal Mooney) [09:02:24] Yep thanks dcausse for your assistance :3 Sorry I just saw your msg! Seems you run resetAuthenticationThrottle.php without indicating the ip [09:02:41] (03CR) 10Elukey: [C:03+1] Also switch cumin2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1160133 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:03:01] (03CR) 10Elukey: [C:03+1] Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:05:57] Superpes15: I ran with --ip 1.2.3.4 (somehow I thought it did not matter for this case), do I need to run it with the range, like this: mwscript-k8s --comment="resetting throttle cache for T409852" --follow --sal -- resetAuthenticationThrottle.php zhwiki --signup --ip=103.108.250.0/24 [09:05:58] T409852: Requesting temporary lift of IP cap for 15/11 edit-a-thon - https://phabricator.wikimedia.org/T409852 [09:06:22] Yep it should be --ip=103.108.250.0/24 dcausse I was just checking in the code if it supports the range [09:06:30] ack, running [09:06:37] (03CR) 10Clément Goubert: [C:03+2] Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 (owner: 10Daniel Kinzler) [09:06:48] !log dcausse@deploy2002 mwscript-k8s job started: resetAuthenticationThrottle.php zhwiki --signup --ip=103.108.250.0/24 # resetting throttle cache for T409852 [09:08:04] Otherwise, if you get an error trying to run the script with a range, just don't run it.. there shouldn't be any issue since the event is scheduled in 3 days (even if less than 72 hours) [09:08:48] (03Merged) 10jenkins-bot: Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 (owner: 10Daniel Kinzler) [09:09:02] (03PS1) 10Clément Goubert: rest-gateway: Fix shadow mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204249 [09:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:14] Superpes15: it worked just fine, not running on enwiki since I understand it does not matter which wiki it uses, but please let me know if you think otherwise [09:09:29] (03CR) 10Clément Goubert: [C:03+2] site.pp: Add new wikikube insetup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1200116 (https://phabricator.wikimedia.org/T408749) (owner: 10Clément Goubert) [09:09:58] Yep exactly! No need to run on enwiki too dcausse :) [09:10:01] Many thanks [09:10:25] yw! thanks for the help on this script, til! :) [09:10:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11365411 (10Clement_Goubert) a:05Clement_Goubert→03Jhancock.wm Puppet updated [09:10:43] And sorry for the late reply! [09:10:47] np! [09:10:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11365413 (10Clement_Goubert) a:03Jhancock.wm Puppet updated [09:11:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11365415 (10Clement_Goubert) a:05Clement_Goubert→03Jhancock.wm Puppet updated [09:11:31] (03PS1) 10Kevin Bazira: ml-services: deploy revertrisk-wikidata to the revision-models ns staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204250 (https://phabricator.wikimedia.org/T406179) [09:11:49] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:12:00] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:12:46] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:12:50] (03Merged) 10jenkins-bot: ext.confirmEdit.hCaptcha: Consider action=submit an edit interface [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204088 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [09:12:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11365423 (10Clement_Goubert) Puppet updated, but we've got some work to do so the hosts can be racked in E/F (see https://phabricator.wikimedia.org/T405... [09:12:53] (03Merged) 10jenkins-bot: Refactor CaptchaScoreHooks to use EventSubmitter [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204163 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [09:12:58] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:13:18] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review George." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204108 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [09:13:28] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204088|ext.confirmEdit.hCaptcha: Consider action=submit an edit interface (T409701 T409703 T409415)]], [[gerrit:1204163|Refactor CaptchaScoreHooks to use EventSubmitter (T405597)]] [09:13:37] T409701: hCaptcha: Log challenge event as "saveFailure" in EditAttemptStep - https://phabricator.wikimedia.org/T409701 [09:13:37] T409703: hCaptcha: Log challenge close and expiry events to VisualEditorFeatureUse - https://phabricator.wikimedia.org/T409703 [09:13:38] T409415: hCaptcha: Track events for edits in Prometheus - https://phabricator.wikimedia.org/T409415 [09:13:38] T405597: hCaptcha: Update instrumentation for risk score - https://phabricator.wikimedia.org/T405597 [09:14:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 2%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85265 and previous config saved to /var/cache/conftool/dbconfig/20251112-091447-root.json [09:15:48] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1204088|ext.confirmEdit.hCaptcha: Consider action=submit an edit interface (T409701 T409703 T409415)]], [[gerrit:1204163|Refactor CaptchaScoreHooks to use EventSubmitter (T405597)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:17:02] (03PS4) 10Tiziano Fogli: metamonitoring: add icinga module [puppet] - 10https://gerrit.wikimedia.org/r/1203845 (https://phabricator.wikimedia.org/T397003) [09:17:20] (03CR) 10Clément Goubert: [C:03+1] "LGTM for only replica bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [09:18:33] !log kharlan@deploy2002 kharlan: Continuing with sync [09:21:52] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: Restart [09:22:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11365454 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=efc351b5-c9f2-4bac-808a-7ec10adef598) set by jynus@cumin1003 for 2:00:00 on 1... [09:22:49] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204088|ext.confirmEdit.hCaptcha: Consider action=submit an edit interface (T409701 T409703 T409415)]], [[gerrit:1204163|Refactor CaptchaScoreHooks to use EventSubmitter (T405597)]] (duration: 09m 21s) [09:22:56] T409701: hCaptcha: Log challenge event as "saveFailure" in EditAttemptStep - https://phabricator.wikimedia.org/T409701 [09:22:56] T409703: hCaptcha: Log challenge close and expiry events to VisualEditorFeatureUse - https://phabricator.wikimedia.org/T409703 [09:22:57] T409415: hCaptcha: Track events for edits in Prometheus - https://phabricator.wikimedia.org/T409415 [09:22:57] T405597: hCaptcha: Update instrumentation for risk score - https://phabricator.wikimedia.org/T405597 [09:25:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203829 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [09:26:41] (03PS5) 10Tiziano Fogli: metamonitoring: add icinga module [puppet] - 10https://gerrit.wikimedia.org/r/1203845 (https://phabricator.wikimedia.org/T397003) [09:26:58] (03Merged) 10jenkins-bot: EventLogging: Register mediawiki.hcaptcha.risk_score stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203829 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [09:27:27] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1203829|EventLogging: Register mediawiki.hcaptcha.risk_score stream (T405597)]] [09:27:36] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:29:06] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:29:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 3%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85266 and previous config saved to /var/cache/conftool/dbconfig/20251112-092953-root.json [09:29:59] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1203829|EventLogging: Register mediawiki.hcaptcha.risk_score stream (T405597)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:30:03] T405597: hCaptcha: Update instrumentation for risk score - https://phabricator.wikimedia.org/T405597 [09:30:56] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring: add icinga module (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203845 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:31:19] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1171.eqiad.wmnet [09:31:19] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1171.eqiad.wmnet [09:32:03] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance registry1005:9100) - https://phabricator.wikimedia.org/T409817#11365493 (10Clement_Goubert) @Blake will handle this one [09:32:41] !log kharlan@deploy2002 kharlan: Continuing with sync [09:38:08] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203829|EventLogging: Register mediawiki.hcaptcha.risk_score stream (T405597)]] (duration: 10m 41s) [09:38:12] T405597: hCaptcha: Update instrumentation for risk score - https://phabricator.wikimedia.org/T405597 [09:39:03] (03CR) 10Harroyo-wmf: [C:03+1] hCaptcha: Set fallback for ConfirmEditTriggersCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203841 (https://phabricator.wikimedia.org/T409736) (owner: 10Kosta Harlan) [09:39:27] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 9 hosts with reason: MariaDB and kernel upgrade and restart [09:40:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204112 (owner: 10Kosta Harlan) [09:41:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203841 (https://phabricator.wikimedia.org/T409736) (owner: 10Kosta Harlan) [09:44:01] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-serve1012.eqiad.wmnet with reason: manually adjusting host DNS to new IPs ahead of reimage [09:44:19] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [09:44:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 4%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85267 and previous config saved to /var/cache/conftool/dbconfig/20251112-094458-root.json [09:46:53] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance registry1005:9100) - https://phabricator.wikimedia.org/T409817#11365534 (10Blake) 05Open→03In progress p:05Triage→03Low a:03Blake [09:49:29] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update IPs for ml-server1012 - cmooney@cumin1003" [09:49:33] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update IPs for ml-server1012 - cmooney@cumin1003" [09:49:33] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:50:22] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ml-serve1012.eqiad.wmnet on all recursors [09:50:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve1012.eqiad.wmnet on all recursors [09:54:44] (03Merged) 10jenkins-bot: Throttler: Use SecurityLogContext [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204112 (owner: 10Kosta Harlan) [09:54:49] !log rolling restart of dbprov hosts for mariadb+kernel upgrade [09:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:17] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204112|Throttler: Use SecurityLogContext]] [09:57:31] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1204112|Throttler: Use SecurityLogContext]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:00:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85268 and previous config saved to /var/cache/conftool/dbconfig/20251112-100004-root.json [10:00:39] (03PS1) 10Marostegui: instances.yaml: Add db1264 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1204260 (https://phabricator.wikimedia.org/T407941) [10:01:35] !log kharlan@deploy2002 kharlan: Continuing with sync [10:01:51] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1264 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1204260 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui) [10:02:17] PROBLEM - icinga-extmon.wikimedia.org requires authentication on alert1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 401 Unauthorized https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [10:03:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db1264 to x1 depooled T407941', diff saved to https://phabricator.wikimedia.org/P85270 and previous config saved to /var/cache/conftool/dbconfig/20251112-100346-marostegui.json [10:03:51] T407941: Productionize x1 expansion hosts - https://phabricator.wikimedia.org/T407941 [10:05:52] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204112|Throttler: Use SecurityLogContext]] (duration: 10m 35s) [10:07:27] (03PS2) 10Ladsgroup: Revert "pagers: Make history pager work with Postgres" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204258 (https://phabricator.wikimedia.org/T409831) [10:09:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203841 (https://phabricator.wikimedia.org/T409736) (owner: 10Kosta Harlan) [10:09:47] (03CR) 10Ladsgroup: [C:03+2] Revert "pagers: Make history pager work with Postgres" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204258 (https://phabricator.wikimedia.org/T409831) (owner: 10Ladsgroup) [10:09:59] (03Merged) 10jenkins-bot: hCaptcha: Set fallback for ConfirmEditTriggersCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203841 (https://phabricator.wikimedia.org/T409736) (owner: 10Kosta Harlan) [10:10:29] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1203841|hCaptcha: Set fallback for ConfirmEditTriggersCaptcha (T409736)]] [10:10:32] Amir1: I'm still finishing up the backport window [10:10:33] T409736: hCaptcha: Adjust ConfirmEditTriggersCaptcha hook in operations/mediawiki-config to implement fallback - https://phabricator.wikimedia.org/T409736 [10:10:40] on the last patch now [10:11:06] this is going to take a while to merge [10:11:25] (it's backport branch) [10:11:48] yep [10:11:58] anyway, I'll hand it over to you after I'm done with this config patch [10:12:50] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1203841|hCaptcha: Set fallback for ConfirmEditTriggersCaptcha (T409736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:14:09] !log kharlan@deploy2002 kharlan: Continuing with sync [10:15:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 6%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85271 and previous config saved to /var/cache/conftool/dbconfig/20251112-101510-root.json [10:18:21] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203841|hCaptcha: Set fallback for ConfirmEditTriggersCaptcha (T409736)]] (duration: 07m 51s) [10:18:25] T409736: hCaptcha: Adjust ConfirmEditTriggersCaptcha hook in operations/mediawiki-config to implement fallback - https://phabricator.wikimedia.org/T409736 [10:19:29] Amir1: ok, over to you [10:19:46] * andre lines up after [10:20:03] thanks! [10:21:14] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203026 (owner: 10PipelineBot) [10:21:37] (03PS2) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202744 [10:22:21] (03PS1) 10Blake: puppet: replace docker-registry stop with systemd mask [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) [10:23:08] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203026 (owner: 10PipelineBot) [10:26:36] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for 8 hosts [10:26:40] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix shadow mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204249 (owner: 10Clément Goubert) [10:26:41] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 8 hosts [10:26:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204258 (https://phabricator.wikimedia.org/T409831) (owner: 10Ladsgroup) [10:28:05] (03Merged) 10jenkins-bot: Revert "pagers: Make history pager work with Postgres" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204258 (https://phabricator.wikimedia.org/T409831) (owner: 10Ladsgroup) [10:28:36] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1204258|Revert "pagers: Make history pager work with Postgres" (T409831)]] [10:28:39] (03Merged) 10jenkins-bot: rest-gateway: Fix shadow mode in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204249 (owner: 10Clément Goubert) [10:28:39] T409831: Previous link in page history is broken - https://phabricator.wikimedia.org/T409831 [10:30:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 7%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85272 and previous config saved to /var/cache/conftool/dbconfig/20251112-103016-root.json [10:30:29] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: enable rate limit in shadow mode on some routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203843 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:30:56] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1204258|Revert "pagers: Make history pager work with Postgres" (T409831)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:32:37] (03Merged) 10jenkins-bot: rest-gateway: enable rate limit in shadow mode on some routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203843 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:32:50] (03CR) 10David Caro: [C:03+2] "Deployed in tools, it created a bunch of pending accounts:" [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [10:32:53] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:33:13] (03CR) 10Gkyziridis: [C:03+1] ml-services: deploy revertrisk-wikidata to the revision-models ns staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204250 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:34:34] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:34:45] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:35:06] (03CR) 10Kevin Bazira: [C:03+2] ml-services: deploy revertrisk-wikidata to the revision-models ns staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204250 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:35:20] (03PS1) 10Muehlenhoff: Add safe.directory directives for the pwstore repository [puppet] - 10https://gerrit.wikimedia.org/r/1204357 (https://phabricator.wikimedia.org/T389380) [10:35:47] (03CR) 10CI reject: [V:04-1] Add safe.directory directives for the pwstore repository [puppet] - 10https://gerrit.wikimedia.org/r/1204357 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [10:36:43] (03Merged) 10jenkins-bot: ml-services: deploy revertrisk-wikidata to the revision-models ns staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204250 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:37:10] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204258|Revert "pagers: Make history pager work with Postgres" (T409831)]] (duration: 08m 34s) [10:37:14] T409831: Previous link in page history is broken - https://phabricator.wikimedia.org/T409831 [10:37:25] andre: revert backported to wmf.2 [10:37:34] Amir1: Thanks so much! [10:37:46] Alright, just enough time to run the train for me :) [10:38:16] (03PS2) 10Muehlenhoff: Add safe.directory directives for the pwstore repository [puppet] - 10https://gerrit.wikimedia.org/r/1204357 (https://phabricator.wikimedia.org/T389380) [10:38:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:39:03] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:39:33] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:42:03] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204358 (https://phabricator.wikimedia.org/T408272) [10:42:05] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204358 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [10:42:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204357 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [10:42:45] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) (owner: 10Blake) [10:42:56] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204358 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [10:44:15] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: enable shadow mode limits on nearly all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203848 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:44:19] (03PS3) 10Daniel Kinzler: rest-gateway: enable shadow mode limits on nearly all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203848 (https://phabricator.wikimedia.org/T406498) [10:44:23] (03CR) 10CI reject: [V:04-1] rest-gateway: enable shadow mode limits on nearly all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203848 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:45:01] (03CR) 10Michael Große: [C:03+1] [beta] GrowthExperiments: add revise-tone experiment setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202721 (https://phabricator.wikimedia.org/T402707) (owner: 10Sergio Gimeno) [10:45:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 8%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85273 and previous config saved to /var/cache/conftool/dbconfig/20251112-104522-root.json [10:45:54] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: enable shadow mode limits on nearly all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203848 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:46:03] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: enable shadow mode limits on nearly all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203848 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:47:57] (03Merged) 10jenkins-bot: rest-gateway: enable shadow mode limits on nearly all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203848 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:48:40] (03CR) 10Majavah: [C:03+1] cloudcephosd: switch 1048 to single interface [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [10:49:25] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11365758 (10tappof) I was running some tests related to the spike we saw here: https://w.wiki/_mzMp . I basically replaced rate with a combination of deriv and... [10:49:28] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.2 refs T408272 [10:49:32] T408272: 1.46.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T408272 [10:50:25] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:50:44] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:53:13] (03PS1) 10Filippo Giunchedi: pontoon: introduce puppet::hosts function [puppet] - 10https://gerrit.wikimedia.org/r/1204360 (https://phabricator.wikimedia.org/T409905) [10:53:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-wikifunctions - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:53:15] (03PS1) 10Filippo Giunchedi: pontoon: inject netbox metadata for stack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204361 (https://phabricator.wikimedia.org/T409905) [10:53:22] (03CR) 10Muehlenhoff: "Failing PCC5 check is unrelated and irrelavant" [puppet] - 10https://gerrit.wikimedia.org/r/1204357 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [10:53:55] (03CR) 10CI reject: [V:04-1] pontoon: introduce puppet::hosts function [puppet] - 10https://gerrit.wikimedia.org/r/1204360 (https://phabricator.wikimedia.org/T409905) (owner: 10Filippo Giunchedi) [10:56:31] (03PS2) 10Filippo Giunchedi: pontoon: introduce puppet::hosts function [puppet] - 10https://gerrit.wikimedia.org/r/1204360 (https://phabricator.wikimedia.org/T409905) [10:56:31] (03PS2) 10Filippo Giunchedi: pontoon: inject netbox metadata for stack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204361 (https://phabricator.wikimedia.org/T409905) [10:58:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-wikifunctions - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:58:23] (03CR) 10Elukey: [C:03+1] Add safe.directory directives for the pwstore repository [puppet] - 10https://gerrit.wikimedia.org/r/1204357 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1100) [11:00:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 9%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85275 and previous config saved to /var/cache/conftool/dbconfig/20251112-110028-root.json [11:00:52] NOTE: I have to revert the train due to https://phabricator.wikimedia.org/T409876 [11:01:06] which collides with "Deploy window MediaWiki infrastructure" - sorry [11:02:01] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: clean up test config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203849 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [11:02:07] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: clean up test config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203849 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [11:02:10] (03PS2) 10Daniel Kinzler: rest-gateway: clean up test config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203849 (https://phabricator.wikimedia.org/T406498) [11:02:11] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: clean up test config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203849 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [11:03:42] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204362 (https://phabricator.wikimedia.org/T408272) [11:03:45] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204362 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [11:03:57] (03Merged) 10jenkins-bot: rest-gateway: clean up test config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203849 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [11:04:04] Reverting the train now, sorry to collide with the MW infra window, we've had many backports today [11:04:07] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:50] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:05:23] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204362 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [11:06:23] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:06:30] oops, accidentally started my window early due to daylight confusion time sheninanigans. It should interfere with train at all anyway but I'll wait my turn :) [11:06:43] shouldn't* [11:06:54] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:07:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:07:30] thanks [11:07:59] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:08:13] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:12:25] (03CR) 10Muehlenhoff: [C:03+2] Also switch cumin2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1160133 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:13:12] (03PS1) 10Effie Mouzeli: proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 [11:14:49] (03CR) 10CI reject: [V:04-1] proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 (owner: 10Effie Mouzeli) [11:15:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85277 and previous config saved to /var/cache/conftool/dbconfig/20251112-111534-root.json [11:17:16] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.2 refs T408272 [11:17:20] T408272: 1.46.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T408272 [11:17:28] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on asw2-c-eqiad,ssw1-d8-eqiad with reason: shutting down one leg of LAG from ssw1-d8-eqiad to asw2-c7-eqiad [11:17:36] 06SRE, 06Infrastructure-Foundations, 10netops: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11365890 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e41d36ab-ea9e-437e-a0db-341d018dedf6) set by cmooney@cumin1003 for 2:00:00 on 2 host(s) and their services w... [11:18:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-wikifunctions - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:18:17] !log jmm@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cumin2002.codfw.wmnet [11:18:21] I'm done with train for now. Sorry again to run over time [11:18:30] ^ Mvolz (not sure if relevant for you or not :P ) [11:18:56] !log shut down link from ssw1-d8-eqiad ethernet-1/28 <-> asw2-c7-eqiad et-7/0/49 to observe results T409800 [11:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:00] T409800: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800 [11:19:50] (03CR) 10Alexandros Kosiaris: [C:03+1] machinetranslation: Increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [11:20:00] (03CR) 10Alexandros Kosiaris: [C:03+1] machinetranslation: Increase replicas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [11:22:58] !log will not shut just yet will log again when about to do so T409800 [11:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:46] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cumin2002.codfw.wmnet [11:25:04] !log jmm@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cumin2002.codfw.wmnet [11:26:31] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [11:27:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202721 (https://phabricator.wikimedia.org/T402707) (owner: 10Sergio Gimeno) [11:30:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 20%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85278 and previous config saved to /var/cache/conftool/dbconfig/20251112-113040-root.json [11:33:18] (03PS1) 10Marostegui: clouddb1022: Initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/1204364 (https://phabricator.wikimedia.org/T409557) [11:34:15] (03CR) 10Marostegui: "I will depool first clouddb1013 for s3 and then clouddb1016 for x3." [puppet] - 10https://gerrit.wikimedia.org/r/1204364 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [11:36:43] (03PS1) 10Majavah: P:wmcs: replica_cnf_api: Stop checking for envvar existence [puppet] - 10https://gerrit.wikimedia.org/r/1204365 [11:36:47] !log migrated cumin2002 to nftables T389380 [11:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:50] T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380 [11:37:52] (03PS1) 10Clément Goubert: rest-gateway: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204366 [11:39:38] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [11:40:12] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204366 (owner: 10Clément Goubert) [11:41:20] (03PS1) 10Kosta Harlan: hCaptcha instrumentation: Log editor_interface for editAttempStep [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204367 (https://phabricator.wikimedia.org/T409701) [11:41:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cumin2002.codfw.wmnet [11:41:58] (03Merged) 10jenkins-bot: rest-gateway: Bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204366 (owner: 10Clément Goubert) [11:43:22] FIRING: [13x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204367 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [11:43:43] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:43:50] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:44:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:44:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:44:13] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:44:17] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:45:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:45:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:45:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:45:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:45:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85279 and previous config saved to /var/cache/conftool/dbconfig/20251112-114545-root.json [11:45:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:45:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:47:49] (03PS2) 10KartikMistry: Update cxserver to 2025-11-12-114324-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204172 [11:48:08] (03PS1) 10Muehlenhoff: Enable nftables on cluster::management on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1204368 (https://phabricator.wikimedia.org/T389380) [11:51:17] (03PS1) 10Muehlenhoff: Switch cloudcumin2001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1204369 [11:53:22] FIRING: [13x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204369 (owner: 10Muehlenhoff) [11:54:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:54:09] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:54:14] (03CR) 10Muehlenhoff: [C:03+2] Add safe.directory directives for the pwstore repository [puppet] - 10https://gerrit.wikimedia.org/r/1204357 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:55:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:55:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:55:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:55:09] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:55:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:55:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:57:28] (03PS1) 10Filippo Giunchedi: pontoon: clean puppet certs on host destroy [puppet] - 10https://gerrit.wikimedia.org/r/1204370 (https://phabricator.wikimedia.org/T409912) [11:57:31] Deploying Cxserver. Config only changes. [12:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1200). nyaa~ [12:00:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 30%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85280 and previous config saved to /var/cache/conftool/dbconfig/20251112-120051-root.json [12:01:22] (03CR) 10David Caro: [C:03+1] "Oh yep, I was thinking about this yesterday, but forgot to add it, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1204365 (owner: 10Majavah) [12:05:13] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-11-12-114324-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204172 (owner: 10KartikMistry) [12:05:51] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:06:26] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:07:14] (03Merged) 10jenkins-bot: Update cxserver to 2025-11-12-114324-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204172 (owner: 10KartikMistry) [12:07:59] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:08:25] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:08:49] (03CR) 10Majavah: [C:03+2] P:wmcs: replica_cnf_api: Stop checking for envvar existence [puppet] - 10https://gerrit.wikimedia.org/r/1204365 (owner: 10Majavah) [12:09:13] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:09:43] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:14:03] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:14:29] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:14:35] !log shut down link from ssw1-d8-eqiad ethernet-1/28 <-> asw2-c7-eqiad et-7/0/49 to observe results T409800 [12:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:39] T409800: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800 [12:15:50] (03CR) 10Muehlenhoff: "PCC5 failure is expected on *cumin* nodes (uses some P7-specific syntax)" [puppet] - 10https://gerrit.wikimedia.org/r/1204369 (owner: 10Muehlenhoff) [12:15:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 35%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85281 and previous config saved to /var/cache/conftool/dbconfig/20251112-121557-root.json [12:19:47] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:20:19] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:20:41] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:21:17] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:21:33] (03CR) 10Majavah: [C:03+1] Switch cloudcumin2001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1204369 (owner: 10Muehlenhoff) [12:22:57] (03PS2) 10Effie Mouzeli: proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 [12:22:57] (03CR) 10FNegri: [C:03+1] clouddb1022: Initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/1204364 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [12:23:10] (03PS3) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202745 (owner: 10PipelineBot) [12:25:56] (03CR) 10Alexandros Kosiaris: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) (owner: 10Blake) [12:28:00] (03PS4) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202744 (owner: 10PipelineBot) [12:30:10] !log Updated cxserver to 2025-11-12-114324-production (T408515) [12:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:14] T408515: Update Apertium service to Trixie - https://phabricator.wikimedia.org/T408515 [12:30:51] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202744 (owner: 10PipelineBot) [12:31:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 40%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85282 and previous config saved to /var/cache/conftool/dbconfig/20251112-123103-root.json [12:36:50] (03PS1) 10Muehlenhoff: Update pwstore docs to point to cumin1003 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1204375 (https://phabricator.wikimedia.org/T389380) [12:37:05] (03CR) 10JMeybohm: [C:03+1] "Thanks, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [12:40:30] (03CR) 10Marostegui: [C:03+2] clouddb1022: Initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/1204364 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [12:41:01] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3 [12:41:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1022].eqiad.wmnet with reason: Cloning clouddb1022:s3 [12:42:46] (03PS1) 10Majavah: reverse-proxy: Add new eqiad/codfw per-rack subnets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204376 [12:42:46] (03PS1) 10Majavah: Add script to update reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204377 [12:43:22] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test2001.codfw.wmnet [12:46:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 45%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85285 and previous config saved to /var/cache/conftool/dbconfig/20251112-124609-root.json [12:46:39] (03PS1) 10Muehlenhoff: Ganeti: Remove cumin1002 from allow list for RAPI access [puppet] - 10https://gerrit.wikimedia.org/r/1204380 (https://phabricator.wikimedia.org/T389380) [12:47:49] !log klausman@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS trixie [12:48:02] 06SRE, 06Infrastructure-Foundations, 10netops: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11366307 (10BTullis) [12:48:15] (03PS1) 10David Caro: maintain-dbusers: add stat for last run [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) [12:48:40] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204376 (owner: 10Majavah) [12:49:30] jouncebot: nowandnext [12:49:30] For the next 0 hour(s) and 10 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1200) [12:49:30] In 1 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1400) [12:50:17] (03CR) 10CI reject: [V:04-1] maintain-dbusers: add stat for last run [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [12:51:19] (03CR) 10Clément Goubert: [C:03+1] Add script to update reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204377 (owner: 10Majavah) [12:52:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204376 (owner: 10Majavah) [12:52:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204377 (owner: 10Majavah) [12:54:02] (03Merged) 10jenkins-bot: reverse-proxy: Add new eqiad/codfw per-rack subnets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204376 (owner: 10Majavah) [12:54:04] (03Merged) 10jenkins-bot: Add script to update reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204377 (owner: 10Majavah) [12:54:29] (03CR) 10Cathal Mooney: [C:03+1] "I suspect we are better putting our aggregate private ranges for each site here to make any checks against the list more performant, but t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204376 (owner: 10Majavah) [12:54:39] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1204376|reverse-proxy: Add new eqiad/codfw per-rack subnets]], [[gerrit:1204377|Add script to update reverse-proxy.php]] [12:55:09] (03PS2) 10David Caro: maintain-dbusers: add stat for last run [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) [12:56:27] (03CR) 10Filippo Giunchedi: [C:03+1] Switch cloudcumin2001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1204369 (owner: 10Muehlenhoff) [12:57:04] !log taavi@deploy2002 taavi: Backport for [[gerrit:1204376|reverse-proxy: Add new eqiad/codfw per-rack subnets]], [[gerrit:1204377|Add script to update reverse-proxy.php]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:57:06] (03CR) 10CI reject: [V:04-1] maintain-dbusers: add stat for last run [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [12:59:24] !log taavi@deploy2002 taavi: Continuing with sync [13:00:26] !log klausman@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [13:01:14] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#11366383 (10Marostegui) I am going to merge this into {T200306} [13:01:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85286 and previous config saved to /var/cache/conftool/dbconfig/20251112-130115-root.json [13:01:26] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#11366385 (10Marostegui) Sorry I mean into {T409926} [13:03:13] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [13:04:30] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204376|reverse-proxy: Add new eqiad/codfw per-rack subnets]], [[gerrit:1204377|Add script to update reverse-proxy.php]] (duration: 09m 51s) [13:05:40] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11366397 (10LSobanski) 05Open→03Resolved With the admin email updated, this one should be good to close. [13:08:10] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11366407 (10LSobanski) I just checked and the junk queue is at a reasonable size. We still need to look into a long ter... [13:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:43] (03PS1) 10FNegri: toolsdb: increase innodb_log_file_size to 512M [puppet] - 10https://gerrit.wikimedia.org/r/1204472 (https://phabricator.wikimedia.org/T409922) [13:10:21] (03PS3) 10Lucas Werkmeister (WMDE): Enable the MEX / wbui2025 beta feature on testwikidata (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) [13:10:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) (owner: 10Lucas Werkmeister (WMDE)) [13:13:29] (03PS1) 10Marostegui: check_private_data_report: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1204570 [13:13:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11366434 (10Jclark-ctr) 05Open→03Resolved [13:13:56] (03PS1) 10Majavah: Set $wgGlobalBlockingAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204571 (https://phabricator.wikimedia.org/T409915) [13:14:26] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1204570 (owner: 10Marostegui) [13:14:43] (03CR) 10Elukey: [C:03+2] containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [13:15:25] (03PS1) 10Elukey: Revert "Add MI300X node taints to ml-serve1012" [puppet] - 10https://gerrit.wikimedia.org/r/1204572 [13:16:18] (03CR) 10Elukey: [C:03+2] Revert "Add MI300X node taints to ml-serve1012" [puppet] - 10https://gerrit.wikimedia.org/r/1204572 (owner: 10Elukey) [13:16:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85287 and previous config saved to /var/cache/conftool/dbconfig/20251112-131621-root.json [13:19:42] (03PS1) 10Muehlenhoff: Remove grant from cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/1204574 (https://phabricator.wikimedia.org/T389380) [13:19:59] (03CR) 10Dreamy Jazz: [C:03+1] "Apart from this needing to wait until wmf.3 is deployed everywhere, this LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204571 (https://phabricator.wikimedia.org/T409915) (owner: 10Majavah) [13:23:57] !log installing glib2.0 security updates [13:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:38] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1012.eqiad.wmnet with OS trixie [13:24:44] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11366458 (10Geagea) see {F70139750} {F70139751} {F70139752} [13:28:06] (03CR) 10Tiziano Fogli: [C:03+2] netbox: enable nrpe2nodexp wrapper on check_uncommitted_dns_changes check [puppet] - 10https://gerrit.wikimedia.org/r/1200365 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [13:28:16] (03CR) 10Tiziano Fogli: [C:03+2] nova: enable nrpe2nodexp wrapper on check-flavor_aggregates [puppet] - 10https://gerrit.wikimedia.org/r/1200018 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [13:28:23] (03CR) 10Tiziano Fogli: [C:03+2] neutron: enable nrpe2nodexp wrapper on check-neutron-conntrack [puppet] - 10https://gerrit.wikimedia.org/r/1200016 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [13:28:29] (03CR) 10Tiziano Fogli: [C:03+2] dns: enable nrpe2nodexp wrapper on authdns_update_run check [puppet] - 10https://gerrit.wikimedia.org/r/1200359 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [13:28:34] (03CR) 10Tiziano Fogli: [C:03+2] dotls: enable nrpe2nodexp wrapper on check_dotls [puppet] - 10https://gerrit.wikimedia.org/r/1200088 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [13:28:41] (03CR) 10Marostegui: [C:03+1] "We need to remove the grants in our hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1204574 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [13:31:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Moved it to es7', diff saved to https://phabricator.wikimedia.org/P85288 and previous config saved to /var/cache/conftool/dbconfig/20251112-133127-root.json [13:33:53] (03Abandoned) 10Dreamy Jazz: Revert "Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159580 (https://phabricator.wikimedia.org/T397224) (owner: 10Tchanders) [13:40:11] (03PS1) 10David Caro: maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 [13:41:08] (03PS1) 10Kosta Harlan: Support an "always challenge" SiteKey when shouldForceShowCaptcha is enabled [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204576 (https://phabricator.wikimedia.org/T405595) [13:41:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204576 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [13:41:28] (03CR) 10CI reject: [V:04-1] maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (owner: 10David Caro) [13:42:56] (03CR) 10Elukey: "Ready for a review :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [13:43:08] jouncebot: nowandnext [13:43:08] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [13:43:09] In 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1400) [13:43:16] (03CR) 10Urbanecm: [C:03+2] "Let's ship it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202721 (https://phabricator.wikimedia.org/T402707) (owner: 10Sergio Gimeno) [13:43:37] (03PS1) 10Muehlenhoff: Revert "Remove SSH key for aarora" [puppet] - 10https://gerrit.wikimedia.org/r/1204578 [13:44:01] (03CR) 10Bartosz Wójtowicz: cassandra: create ml_inference_service Cassandra role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) (owner: 10Eevans) [13:44:03] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: add revise-tone experiment setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202721 (https://phabricator.wikimedia.org/T402707) (owner: 10Sergio Gimeno) [13:44:55] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:46:47] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [13:46:58] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T409930 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [13:47:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409930 (10ops-monitoring-bot) 03NEW [13:47:34] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [13:47:46] (03PS1) 10Kosta Harlan: hCaptcha: Define configuration for "always challenge" mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204581 (https://phabricator.wikimedia.org/T405595) [13:48:10] (03CR) 10Elukey: [C:03+1] Ganeti: Remove cumin1002 from allow list for RAPI access [puppet] - 10https://gerrit.wikimedia.org/r/1204380 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [13:48:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204581 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [13:48:31] urbanecm: are you syncing something now? [13:48:49] !log updating cr firewall policy with new caprica definitions, to pick up new clouddb hosts [13:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:59] (03CR) 10Andrew Bogott: [C:03+1] "Looks good. Even if things go poorly, ceph should be fine as long as we only do one at a time." [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [13:49:04] kostajh: already finished, it was a beta-only patch [13:49:30] urbanecm: ok, I'll get started on my patches then [13:49:35] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-11-05-063501 to 2025-11-12-122736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204582 (https://phabricator.wikimedia.org/T407718) [13:49:44] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-11-04-215809 to 2025-11-08-223341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204583 (https://phabricator.wikimedia.org/T407791) [13:50:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204367 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [13:50:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204576 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [13:50:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204581 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [13:50:50] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve1012 [13:51:35] (03CR) 10David Caro: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (owner: 10David Caro) [13:51:38] (03Merged) 10jenkins-bot: hCaptcha: Define configuration for "always challenge" mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204581 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [13:52:07] (03CR) 10Tiziano Fogli: "I’d like to test the patch on Pontoon, but I’m getting an error unrelated to the patch itself. I’ll take a look ASAP, so if you can hold o" [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite) [13:52:33] (03PS2) 10David Caro: maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) [13:52:38] (03Merged) 10jenkins-bot: hCaptcha instrumentation: Log editor_interface for editAttempStep [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204367 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [13:53:04] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:53:21] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11366599 (10Andrew) OSD nodes up through 1034 are scheduled for decom in 2026. Unless there's an urgent port shortage, we should only retcon 1035 and... [13:54:34] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [13:55:20] (03PS3) 10David Caro: maintain-dbusers: add stat for last run [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) [13:55:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve1012 [13:59:10] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933 (10Lydia_Pintscher) 03NEW [13:59:30] (03CR) 10David Caro: maintain_dbusers: add basic alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [13:59:50] (03PS3) 10David Caro: maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) [13:59:54] (03CR) 10David Caro: maintain_dbusers: add basic alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1400). [14:00:05] awight, kostajh, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:41] hi, i'm here and syncing my pathes already [14:00:42] I'm here instead of awight. [14:00:45] should be done in ~10-15 minutes [14:00:45] o/ [14:00:50] (03Merged) 10jenkins-bot: Support an "always challenge" SiteKey when shouldForceShowCaptcha is enabled [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204576 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [14:01:04] I think we might also want to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1204579 if there’s enough time (unless Dreamy_Jazz disagrees) [14:01:09] but let’s do the scheduled stuff first :) [14:01:12] * Lucas_WMDE waits for kostajh [14:01:23] Definitely agree with backporting that [14:01:28] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204367|hCaptcha instrumentation: Log editor_interface for editAttempStep (T409701)]], [[gerrit:1204576|Support an "always challenge" SiteKey when shouldForceShowCaptcha is enabled (T405595)]], [[gerrit:1204581|hCaptcha: Define configuration for "always challenge" mode (T405595)]] [14:01:28] Just needs a +2 first :D [14:01:32] ack ^^ [14:01:34] T409701: hCaptcha: Log challenge event as "saveFailure" in EditAttemptStep - https://phabricator.wikimedia.org/T409701 [14:01:35] T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595 [14:03:26] * Lucas_WMDE looks at Thiemo’s change [14:03:52] ok, looks fine [14:04:19] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit before deployment" [extensions/Cite] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204244 (https://phabricator.wikimedia.org/T409808) (owner: 10Awight) [14:04:30] let’s see if that merges before the current scap finishes ^^ [14:05:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11366677 (10WMDE-leszek) I approve this request on WMDE's end. Thank you! [14:06:51] (03CR) 10Eevans: cassandra: create ml_inference_service Cassandra role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) (owner: 10Eevans) [14:06:52] wowzers, that’s a lot of warnings in logspam-watch btw [14:06:57] * Lucas_WMDE searches phab [14:07:20] T409910, already UBN [14:07:21] T409910: PHP Warning: foreach() argument must be of type array|object, null given / PHP Warning: Undefined array key "extensionData" - https://phabricator.wikimedia.org/T409910 [14:07:31] * Lucas_WMDE agrees with that assessment [14:07:41] (03PS6) 10Esanders: Enable DiscussionTools visual enhancements everywhere except enwiki, svwiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) [14:07:50] :D always a fun time in our logs [14:08:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [14:08:17] and “Using bool as a message parameter was deprecated in MediaWiki 1.43” is right behind it (albeit at a *much* lower volume), fortunately the extra backport should fix that one [14:08:53] o/ also here for a config change I can self deploy [14:09:33] ack – kostajh is currently deploying and after that I’ll probably do https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/1204244 (unless its gate-and-submit takes longer) [14:09:45] ok [14:09:59] I just thought “that image build is taking a while” and, yeah, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ConfirmEdit/+/1204576 touches i18n :S [14:10:27] and it’s not even at the “sleeping for five minutes” stage yet :( [14:13:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2230.codfw.wmnet with reason: Clone T400056 [14:13:11] T400056: Core DB testbed on VMs - https://phabricator.wikimedia.org/T400056 [14:14:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db-test2001.codfw.wmnet with reason: Clone T400056 [14:14:30] (03Merged) 10jenkins-bot: Hide edit one/all checkbox when needed [extensions/Cite] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204244 (https://phabricator.wikimedia.org/T409808) (owner: 10Awight) [14:14:43] there it goes [14:17:40] ok the image build is at last at the “waiting 300 seconds” stage [14:17:41] (03PS2) 10Ladsgroup: mysql: Rename cookbooks to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 [14:17:50] so in five minutes, kosta’s deploy can properly begin :S [14:19:06] (03PS1) 10Awight: Revert "Temporarily revoke ssh access for awight" [puppet] - 10https://gerrit.wikimedia.org/r/1204586 [14:19:13] (03CR) 10Awight: [C:03+1] Revert "Temporarily revoke ssh access for awight" [puppet] - 10https://gerrit.wikimedia.org/r/1204586 (owner: 10Awight) [14:20:55] Unfortunate that spiderpig doesn't show any progress information publicly, yet. [14:21:15] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Revert "Temporarily revoke ssh access for awight" [puppet] - 10https://gerrit.wikimedia.org/r/1204586 (owner: 10Awight) [14:22:21] (03PS2) 10Blake: puppet: replace docker-registry stop with systemd mask [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) [14:22:39] (03PS1) 10Elukey: admin_ng: add lsw1-e9-eqiad to BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204587 [14:22:50] (03CR) 10CI reject: [V:04-1] puppet: replace docker-registry stop with systemd mask [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) (owner: 10Blake) [14:22:57] (03CR) 10Blake: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) (owner: 10Blake) [14:23:23] on the way to test servers now [14:23:51] (03PS3) 10Blake: puppet: replace docker-registry stop with systemd mask [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) [14:25:09] (03CR) 10CI reject: [V:04-1] mysql: Rename cookbooks to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [14:25:40] (03CR) 10Blake: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) (owner: 10Blake) [14:27:19] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1204367|hCaptcha instrumentation: Log editor_interface for editAttempStep (T409701)]], [[gerrit:1204576|Support an "always challenge" SiteKey when shouldForceShowCaptcha is enabled (T405595)]], [[gerrit:1204581|hCaptcha: Define configuration for "always challenge" mode (T405595)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can [14:27:19] now be verified there. [14:27:25] T409701: hCaptcha: Log challenge event as "saveFailure" in EditAttemptStep - https://phabricator.wikimedia.org/T409701 [14:27:25] T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595 [14:29:29] (03PS1) 10Jforrester: StringForLanguageBuilder: Use LanguageFallbackMode enum [extensions/WikiLambda] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204589 (https://phabricator.wikimedia.org/T409876) [14:29:56] (03PS3) 10Ladsgroup: mysql: Rename cookbooks to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 [14:31:28] (03PS1) 10Lucas Werkmeister (WMDE): BlockErrorFormatter: Convert booleans to string in message params [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204590 (https://phabricator.wikimedia.org/T409810) [14:31:43] (03CR) 10Dreamy Jazz: [C:03+1] BlockErrorFormatter: Convert booleans to string in message params [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204590 (https://phabricator.wikimedia.org/T409810) (owner: 10Lucas Werkmeister (WMDE)) [14:31:54] testing [14:33:11] !log kharlan@deploy2002 kharlan: Continuing with sync [14:33:22] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:06] (03CR) 10CI reject: [V:04-1] mysql: Rename cookbooks to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [14:38:24] (03PS1) 10Itamar Givon: Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) [14:38:24] (03PS4) 10Blake: puppet: replace docker-registry stop with systemd mask [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) [14:38:25] (03PS1) 10Itamar Givon: Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) [14:38:27] (03PS1) 10Itamar Givon: Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) [14:38:33] (03PS1) 10Itamar Givon: Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) [14:38:43] jeez, the real deployment is also taking ages [14:38:54] Finished sync-canaries-k8s (duration: 04m 32s) [14:39:18] I guess it has to download large new images (two of them – PHP8.1 and 8.3) all over the place [14:39:44] (03PS5) 10Blake: puppet: replace docker-registry stop with systemd mask [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) [14:40:25] It wasn't running as slow yesterday [14:40:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204590 (https://phabricator.wikimedia.org/T409810) (owner: 10Lucas Werkmeister (WMDE)) [14:40:43] https://spiderpig.wikimedia.org/jobs/890 shows it taking 3m yesterday [14:40:46] yeah, this is all because the change(s) to backport include i18n [14:40:48] (03CR) 10Blake: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) (owner: 10Blake) [14:40:48] (03PS2) 10Elukey: admin_ng: add row-e9 to lsw1-e8-eqiad in BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204587 [14:40:53] (I think) [14:40:54] Oh yea [14:41:13] Nowadays we almost need to say to never backport i18n in a window with other humans. :-( [14:41:28] feels like it yeah :( [14:41:38] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204587 (owner: 10Elukey) [14:41:44] * Lucas_WMDE is also amazed to see sync-canaries-k8s take almost as long as sync-prod-k8s in https://spiderpig.wikimedia.org/jobs/890 [14:45:37] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1003.eqiad.wmnet [14:45:38] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [14:46:22] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet [14:46:32] (03PS3) 10Elukey: admin_ng: add row-e{9,10} to lsw1-e{7,8}-eqiad in BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204587 [14:46:37] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1002.eqiad.wmnet [14:46:55] (03PS1) 10Marostegui: installserver: Do not format clouddb1022 [puppet] - 10https://gerrit.wikimedia.org/r/1204597 [14:46:57] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 3 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T409938 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:47:05] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938 (10ops-monitoring-bot) 03NEW [14:47:07] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1002.eqiad.wmnet [14:47:09] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3 [14:47:12] (03PS6) 10Blake: puppet: add systemd mask to docker-registry stop [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) [14:47:21] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1001.eqiad.wmnet [14:47:44] (03CR) 10Clément Goubert: [C:03+2] puppet: add systemd mask to docker-registry stop [puppet] - 10https://gerrit.wikimedia.org/r/1204356 (https://phabricator.wikimedia.org/T409817) (owner: 10Blake) [14:47:47] (03PS1) 10Itamar Givon: Update documentation for rdf_functions.sh path in dumpwikibaserdf.sh [dumps] - 10https://gerrit.wikimedia.org/r/1204598 (https://phabricator.wikimedia.org/T408800) [14:48:36] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204367|hCaptcha instrumentation: Log editor_interface for editAttempStep (T409701)]], [[gerrit:1204576|Support an "always challenge" SiteKey when shouldForceShowCaptcha is enabled (T405595)]], [[gerrit:1204581|hCaptcha: Define configuration for "always challenge" mode (T405595)]] (duration: 47m 08s) [14:48:38] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:48:41] T409701: hCaptcha: Log challenge event as "saveFailure" in EditAttemptStep - https://phabricator.wikimedia.org/T409701 [14:48:42] T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595 [14:48:42] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1003.eqiad.wmnet [14:48:52] Yes, sorry about that everyone. I should have started earlier [14:48:55] done now [14:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:51:20] o_O [14:51:45] (03PS1) 10Elukey: cpufrequtils: add proper exec perms to /usr/libexec/cpupower [puppet] - 10https://gerrit.wikimedia.org/r/1204599 [14:51:47] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1204244|Hide edit one/all checkbox when needed (T409808)]] [14:51:51] T409808: Change all subrefs checkbox wrongly shows up - https://phabricator.wikimedia.org/T409808 [14:52:43] kostajh: you even started early but it wasn’t early enough :( [14:52:46] “Duration 58m 30s” 💀 [14:52:52] that’s brutal [14:53:12] I don’t blame you for not knowing it would be *that* bad ._. [14:53:22] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:36] jouncebot: next [14:53:36] In 0 hour(s) and 6 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1500) [14:53:39] mh [14:53:52] Lucas_WMDE: i18n changes? [14:53:59] yes, but even so [14:54:10] my memory of i18n changes is that they take something like 20 or 30 minutes [14:54:18] not one entire window [14:54:22] (03CR) 10Marostegui: [C:03+2] installserver: Do not format clouddb1022 [puppet] - 10https://gerrit.wikimedia.org/r/1204597 (owner: 10Marostegui) [14:54:31] my memory says they're unpredictable [14:54:41] but yeah...we should really do something about that [14:55:44] !log lucaswerkmeister-wmde@deploy2002 awight, lucaswerkmeister-wmde: Backport for [[gerrit:1204244|Hide edit one/all checkbox when needed (T409808)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:55:55] Thiemo_WMDE: please test [14:55:58] (if possible) [14:56:37] Done. It's fixed. [14:56:43] !log lucaswerkmeister-wmde@deploy2002 awight, lucaswerkmeister-wmde: Continuing with sync [14:56:44] \o/ [14:56:45] thanks [14:57:23] Our window starts in a few minutes but doesn't touch MW stuff. [14:57:24] 07sre-alert-triage, 06serviceops, 13Patch-For-Review: Alert in need of triage: SystemdUnitFailed (instance registry1005:9100) - https://phabricator.wikimedia.org/T409817#11366931 (10Blake) 05In progress→03Resolved Merged and deployed - thanks @Clement_Goubert! [14:57:28] (03CR) 10AikoChou: [C:03+1] cassandra: create ml_inference_service Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) (owner: 10Eevans) [14:57:29] So if there are urgent deploys, please continue. [14:57:53] not sure about urgent but I’d still like to get them deployed [14:58:01] I guess https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1204590 is the most urgent one, fixes some logspam [14:58:20] I mean, we have a train blocker at our end we need to deploy too in MW space. [14:58:22] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:31] But our focus at least at first is the services. [14:58:51] I’m guessing the timing of edsanders’ https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133975 was announced to the community as well? [14:59:00] yes [14:59:00] Lucas_WMDE: If you're deploying 1204590 could you do 1204589 at the same time? [14:59:16] *looks* [14:59:27] sure, looks okay [14:59:51] and maybe the change by edsanders too [14:59:59] I feel like those should all be relatively low risk [15:00:00] It's a no-op in actual production (the code is only live on WF.org, and that's not on wmf.2 yet because of this). [15:00:04] Act. [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1500) [15:00:06] Err. [15:00:07] Ack, even. [15:00:08] (whereas my own config change should be separate, that’s already failed once) [15:00:13] If you want to act that'd be great. ;-) [15:00:16] ^^ [15:00:20] sure, can do [15:00:21] my config change is low risk - we've done this a few times before [15:00:35] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204590 (https://phabricator.wikimedia.org/T409810) (owner: 10Lucas Werkmeister (WMDE)) [15:00:39] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikiLambda] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204589 (https://phabricator.wikimedia.org/T409876) (owner: 10Jforrester) [15:01:32] jouncebot: nowandnext [15:01:32] For the next 0 hour(s) and 58 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1500) [15:01:32] In 0 hour(s) and 28 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1530) [15:01:58] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11366966 (10Jclark-ctr) [15:02:02] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409930#11366968 (10Jclark-ctr) →14Duplicate dup:03T409938 [15:02:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11366971 (10Jclark-ctr) a:03Jclark-ctr [15:02:08] oh, right, its announcement was up there, I missed it [15:04:16] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204244|Hide edit one/all checkbox when needed (T409808)]] (duration: 12m 30s) [15:04:20] T409808: Change all subrefs checkbox wrongly shows up - https://phabricator.wikimedia.org/T409808 [15:04:44] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1204578 (owner: 10Muehlenhoff) [15:04:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204589 (https://phabricator.wikimedia.org/T409876) (owner: 10Jforrester) [15:04:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204590 (https://phabricator.wikimedia.org/T409810) (owner: 10Lucas Werkmeister (WMDE)) [15:04:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [15:04:58] that might exceed the IRC message length for the !log messages – we’ll see [15:05:06] Yes. [15:05:15] (03Merged) 10jenkins-bot: BlockErrorFormatter: Convert booleans to string in message params [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204590 (https://phabricator.wikimedia.org/T409810) (owner: 10Lucas Werkmeister (WMDE)) [15:05:16] (03CR) 10Cathal Mooney: [C:03+1] admin_ng: add row-e{9,10} to lsw1-e{7,8}-eqiad in BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204587 (owner: 10Elukey) [15:05:40] (03CR) 10Elukey: [C:03+2] admin_ng: add row-e{9,10} to lsw1-e{7,8}-eqiad in BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204587 (owner: 10Elukey) [15:05:41] (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements everywhere except enwiki, svwiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [15:06:04] (03Merged) 10jenkins-bot: StringForLanguageBuilder: Use LanguageFallbackMode enum [extensions/WikiLambda] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204589 (https://phabricator.wikimedia.org/T409876) (owner: 10Jforrester) [15:06:43] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1204589|StringForLanguageBuilder: Use LanguageFallbackMode enum (T409876)]], [[gerrit:1204590|BlockErrorFormatter: Convert booleans to string in message params (T409810)]], [[gerrit:1133975|Enable DiscussionTools visual enhancements everywhere except enwiki, svwiki and ruwiki (T379264)]] [15:06:48] Dreamy_Jazz: do you know if the block error is testable on mwdebug? [15:06:50] T409876: WikiLambda test failures - https://phabricator.wikimedia.org/T409876 [15:06:50] T409810: PHP Deprecated: Using bool as a message parameter was deprecated in MediaWiki 1.43 - https://phabricator.wikimedia.org/T409810 [15:06:50] T379264: Phase 5: Offer Usability Improvements as default-on feature at remaining large wikis - https://phabricator.wikimedia.org/T379264 [15:06:55] that message still fit within the length limit, yay [15:07:01] (just barely, I suspect ^^) [15:07:16] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:07:19] I guess it should be testable when trying to edit while blocked [15:07:37] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:07:39] I could try to edit using a VPN which should do that [15:07:41] * Lucas_WMDE looks up how to get yourself blocked on wiki ^U [15:07:44] heh, ok [15:07:52] if you have that set up, sure [15:08:04] otherwise I’d be okay with just syncing this, I think the risk of breakage is low [15:08:06] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2025-11-05-063501 to 2025-11-12-122736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204582 (https://phabricator.wikimedia.org/T407718) (owner: 10Jforrester) [15:08:15] I've found NordVPN usually works well enough to get a block [15:08:22] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:23] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198938 (https://phabricator.wikimedia.org/T408223) [15:08:23] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:08:24] Let me check if I can actually reproduce now [15:08:32] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:09:01] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, jforrester, esanders: Backport for [[gerrit:1204589|StringForLanguageBuilder: Use LanguageFallbackMode enum (T409876)]], [[gerrit:1204590|BlockErrorFormatter: Convert booleans to string in message params (T409810)]], [[gerrit:1133975|Enable DiscussionTools visual enhancements everywhere except enwiki, svwiki and ruwiki (T379264)]] synced to the testservers (see [15:09:01] https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:09:11] edsanders: can you test the change? [15:09:22] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:09:31] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:09:51] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove SSH key for aarora" [puppet] - 10https://gerrit.wikimedia.org/r/1204578 (owner: 10Muehlenhoff) [15:09:55] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-11-05-063501 to 2025-11-12-122736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204582 (https://phabricator.wikimedia.org/T407718) (owner: 10Jforrester) [15:10:09] testing [15:10:15] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:10:29] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11367024 (10Krd) Numbers appear correct. [15:10:45] lgtm [15:10:50] ack [15:10:54] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:11:05] Dreamy_Jazz, any success? (I don’t see mwdebug deprecations warnings in logstash yet, at least) [15:11:47] Not yet [15:11:53] ok [15:11:53] Realising that the VPN blocks are global [15:11:59] Which don't have the problem [15:12:21] ah [15:12:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11367044 (10Geagea) how pl 5/4 --> 2/1 pt 109/108 --> 11/10 nl - my mistake [15:13:07] I could perform a test block on testwiki for this [15:13:24] I guess… [15:13:27] not sure it’s worth it tbh [15:14:31] (03CR) 10Bartosz Wójtowicz: cassandra: create ml_inference_service Cassandra role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) (owner: 10Eevans) [15:15:27] I created a testwiki block of the VPN IP I'm using, but can't seem to reproduce it [15:15:42] as in, it’s not showing you the right block message? [15:15:52] or not blocking at all? [15:15:54] It is showing me the message [15:15:59] But no warnings are shown [15:16:05] that’s correct, right? [15:16:10] or do you mean, no warnings even without mwdebug? [15:16:15] Yes [15:16:17] hm [15:16:32] It's possible it was only showing for translation pages? [15:16:41] I guess so [15:16:48] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:16:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1204599 (owner: 10Elukey) [15:16:57] on logstash in the past 30 minutes it only happened on mediawiki.org (42) and wikimania2014.w.o (2) [15:17:04] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, jforrester, esanders: Continuing with sync [15:17:06] let’s just deploy [15:17:08] Yup [15:17:11] Reproduced it [15:17:12] Now [15:17:17] ah ok [15:17:22] (03CR) 10Muehlenhoff: [C:03+2] Ganeti: Remove cumin1002 from allow list for RAPI access [puppet] - 10https://gerrit.wikimedia.org/r/1204380 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [15:17:29] yeah there it is [15:17:45] wow, eight messages for one request. that’s spammy [15:17:47] Tried using mw-debug and no logs [15:17:48] !log migrated pwstore repository from cumin1002 to cumin1003 T389380 [15:17:51] yay [15:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:51] T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380 [15:17:53] thanks for testing! [15:18:01] Np [15:18:04] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:18:13] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:18:15] (03CR) 10Elukey: [C:03+2] cpufrequtils: add proper exec perms to /usr/libexec/cpupower [puppet] - 10https://gerrit.wikimedia.org/r/1204599 (owner: 10Elukey) [15:18:43] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-11-04-215809 to 2025-11-08-223341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204583 (https://phabricator.wikimedia.org/T407791) [15:18:43] (03PS1) 10Jforrester: [WIP] Add Python test call back in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) [15:18:44] and I guess I’ll try deploying my own config change between the xLab and mw infrastructure windows later [15:18:50] looks like the deployment calendar has a nice gap there [15:18:51] Lucas_WMDE: <3 [15:19:07] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:19:56] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-11-04-215809 to 2025-11-08-223341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204583 (https://phabricator.wikimedia.org/T407791) (owner: 10Jforrester) [15:20:40] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198938 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [15:21:19] elukey: go ahead and merge my change [15:21:24] oook [15:21:46] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-11-04-215809 to 2025-11-08-223341 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204583 (https://phabricator.wikimedia.org/T407791) (owner: 10Jforrester) [15:22:17] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:22:32] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2031 - https://phabricator.wikimedia.org/T408410#11367117 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:22:46] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:22:56] !log fceratto@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM db-test1001.eqiad.wmnet [15:23:10] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2029 - https://phabricator.wikimedia.org/T408408#11367120 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:23:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11367125 (10Jclark-ctr) [15:23:20] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:23:37] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2030 - https://phabricator.wikimedia.org/T408409#11367130 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:23:59] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204589|StringForLanguageBuilder: Use LanguageFallbackMode enum (T409876)]], [[gerrit:1204590|BlockErrorFormatter: Convert booleans to string in message params (T409810)]], [[gerrit:1133975|Enable DiscussionTools visual enhancements everywhere except enwiki, svwiki and ruwiki (T379264)]] (duration: 17m 17s) [15:24:06] T409876: WikiLambda's StringForLanguageBuilder relies on LanguageFallback's mode being an int, not the new LanguageFallbackMode enum, breaking views (and tests) - https://phabricator.wikimedia.org/T409876 [15:24:06] T409810: PHP Deprecated: Using bool as a message parameter was deprecated in MediaWiki 1.43 - https://phabricator.wikimedia.org/T409810 [15:24:06] phew [15:24:06] T379264: Phase 5: Offer Usability Improvements as default-on feature at remaining large wikis - https://phabricator.wikimedia.org/T379264 [15:24:12] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:24:14] (03PS1) 10Jforrester: wikifunctions: Set FUNCTION_EVALUATOR_WASI_ACQUIRE_TIMEOUT to 1.5s down from 3s default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) [15:24:15] !log UTC afternoon backport+config window done [15:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:17] * Lucas_WMDE done deploying for now [15:24:24] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:24:33] James_F: wikifunctions should be unblocked (IIUC) [15:24:38] edsanders: feature should be live :) [15:24:40] Lucas_WMDE: Thank you so much! [15:25:04] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:25:16] andre: Train should be unblocked w.r.t. T409876 [15:26:20] thanks! [15:26:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11367164 (10Jclark-ctr) Submitted ticket with Dell SR218576742 2x failed drives [15:27:23] James_F, thank you! [15:27:29] 14SRE-Sprint-Week-Sustainability-March2023, 06serviceops, 07Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398#11367168 (10LSobanski) 05Open→03Resolved a:03LSobanski Here are the changes to the documentation that happened sinc... [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1500) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1530) [15:30:41] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists affcomwiki; (T297297) [15:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:45] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [15:31:28] (03PS1) 10Muehlenhoff: Remove cumin1002 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1204609 (https://phabricator.wikimedia.org/T389380) [15:32:25] (03PS1) 10Brouberol: airflow-test-k8s: test the new image with sasl compiled for python3.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204610 (https://phabricator.wikimedia.org/T408711) [15:32:40] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367204 (10ssingh) [15:33:23] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:40] !log Drop cumin2024@cumin1002 from production - T409929 [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:44] T409929: Remove cumin1002 grants from production - https://phabricator.wikimedia.org/T409929 [15:34:32] (03PS2) 10Jforrester: wikifunctions: Set FUNCTION_EVALUATOR_WASI_ACQUIRE_TIMEOUT to 1.5s down from 3s default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) [15:35:18] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: test the new image with sasl compiled for python3.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204610 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [15:35:53] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM db-test1001.eqiad.wmnet [15:36:02] !log fceratto@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM db-test1002.eqiad.wmnet [15:36:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:38:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:42:49] (03CR) 10Marostegui: [C:03+2] Remove grant from cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/1204574 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [15:42:54] (03PS2) 10Cory Massaro: [WIP] Add Python test call back in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [15:43:42] (03PS3) 10Cory Massaro: Add Python test call back in. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [15:44:08] (03CR) 10Jforrester: Add Python test call back in. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [15:44:58] (03PS1) 10Kamila Součková: mw-cron: enable email_verification_reminder [puppet] - 10https://gerrit.wikimedia.org/r/1204616 [15:45:26] (03PS1) 10Marostegui: production-ms.sql.erb: Remove root@10.64.48.98 [puppet] - 10https://gerrit.wikimedia.org/r/1204617 (https://phabricator.wikimedia.org/T409929) [15:45:50] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367275 (10MoritzMuehlenhoff) Looks good! [15:46:26] (03CR) 10Marostegui: [C:03+2] "This is a NOOP until we remove the grants in production, so fine to merge as I am working on removing the grants now" [puppet] - 10https://gerrit.wikimedia.org/r/1204617 (https://phabricator.wikimedia.org/T409929) (owner: 10Marostegui) [15:47:00] (03CR) 10Marostegui: [V:03+2 C:03+2] production-ms.sql.erb: Remove root@10.64.48.98 [puppet] - 10https://gerrit.wikimedia.org/r/1204617 (https://phabricator.wikimedia.org/T409929) (owner: 10Marostegui) [15:48:26] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367296 (10MoritzMuehlenhoff) One thing to consider, when we actually apply the role on DCs enab... [15:49:16] (03CR) 10Ahmon Dancy: mw-web: Remove the hard-coded k8s version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [15:49:34] (03CR) 10Kamila Součková: "@dreamyjazzwikipedia@gmail.com I noticed that this job was missing, I assume it should be running, is that correct?" [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [15:49:35] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367302 (10MoritzMuehlenhoff) For eqiad best to use B and D and for codfw best to use C and D [15:49:41] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367303 (10ssingh) >>! In T409860#11367296, @MoritzMuehlenhoff wrote: > One thing to consider, w... [15:50:07] (03Abandoned) 10Herron: logstash: set "index.translog.durability": "async" as template default [puppet] - 10https://gerrit.wikimedia.org/r/1136400 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron) [15:51:25] (03CR) 10Elukey: [C:03+1] Remove cumin1002 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1204609 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [15:54:51] (03CR) 10Dreamy Jazz: "Raine, this is already on line 93?" [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [15:56:03] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM db-test1002.eqiad.wmnet [15:56:20] !log fceratto@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM db-test2002.codfw.wmnet [15:57:31] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11367333 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:57:38] (03CR) 10Dreamy Jazz: "Also, the job runs monthly on the 17th. The last run was manually done on the 17th of October, so it's a few days before this should be ru" [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [15:58:03] (03CR) 10Kamila Součková: "Oh, you're right, sorry. But I can't find the job in k8s (that's why I thought it was missing). Do you know whether the job has been runni" [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [15:59:09] (03CR) 10Dreamy Jazz: "It is not supposed to run until the 17th. The last run was on the 17th of October and was done manually (not via puppet)." [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [16:01:21] (03CR) 10Dreamy Jazz: "The relevant SAL entry is reported on Phab at https://phabricator.wikimedia.org/T58074#11285719" [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [16:01:55] (03PS2) 10Eevans: cassandra: create ml_inference_service Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) [16:02:18] (03PS3) 10Eevans: cassandra: create revise_tone_task_generator Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) [16:03:22] (03PS4) 10Cory Massaro: Add Python test call back in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [16:03:56] (03PS5) 10Cory Massaro: Add Python test call back in. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [16:04:25] (03CR) 10Cory Massaro: Add Python test call back in. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [16:05:40] (03CR) 10David Caro: maintain_dbusers: add basic alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [16:05:50] !log fceratto@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM db-test2002.codfw.wmnet [16:06:44] (03PS1) 10Clément Goubert: rest-gateway: Disable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204619 [16:08:08] (03PS1) 10Muehlenhoff: Remove cumin1002 from alertmanager access [puppet] - 10https://gerrit.wikimedia.org/r/1204620 (https://phabricator.wikimedia.org/T389380) [16:10:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004#11367385 (10Jhancock.wm) 05Open→03Resolved @Raine ran into a secondary issue with the backplane, but it's fixed now. let us you know if y... [16:10:27] (03CR) 10Eevans: cassandra: create revise_tone_task_generator Cassandra role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) (owner: 10Eevans) [16:10:54] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11367398 (10Krd) There is a subqueue which counts for the headline but not for the queue view. [16:11:15] (03PS1) 10Muehlenhoff: Remove cumin1002 as Homer git peer [puppet] - 10https://gerrit.wikimedia.org/r/1204622 (https://phabricator.wikimedia.org/T389380) [16:12:59] (03PS1) 10Majavah: hieradata: Enable jumbo frames on all eqiad1 cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1204623 (https://phabricator.wikimedia.org/T330075) [16:13:01] (03PS1) 10Majavah: hieradata: Enable jumbo frames on eqiad1 cloudnets [puppet] - 10https://gerrit.wikimedia.org/r/1204624 (https://phabricator.wikimedia.org/T330075) [16:13:03] (03PS1) 10Majavah: hieradata: Enable jumbo frames on remaining eqiad1 nodes [puppet] - 10https://gerrit.wikimedia.org/r/1204625 (https://phabricator.wikimedia.org/T330075) [16:13:05] (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Cleanup feature flag for jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1204626 (https://phabricator.wikimedia.org/T330075) [16:13:07] (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 [16:13:20] (03PS1) 10Marostegui: wmf_root_client.pp: Remove cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/1204628 (https://phabricator.wikimedia.org/T389380) [16:16:15] (03CR) 10AikoChou: [C:03+1] cassandra: create revise_tone_task_generator Cassandra role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) (owner: 10Eevans) [16:17:07] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah) [16:18:28] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367420 (10ssingh) `sudo cookbook sre.ganeti.makekevm --vcpus 2 --memory 2 --disk 20 --network p... [16:19:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11367421 (10Geagea) my immersion was that it's not ok. But if it's ok then we done. [16:20:33] (03PS7) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [16:20:35] (03PS2) 10Majavah: P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 [16:20:54] (03CR) 10Kamila Součková: [C:03+1] site.pp and preseed.yaml: add new VMs for hcaptcha proxy (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1203917 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:21:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, but let's not merge while cumin1002 is around and has a role, otherwise this will break Puppet runs." [puppet] - 10https://gerrit.wikimedia.org/r/1204628 (https://phabricator.wikimedia.org/T389380) (owner: 10Marostegui) [16:22:11] (03CR) 10Ssingh: [C:03+2] site.pp and preseed.yaml: add new VMs for hcaptcha proxy (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1203917 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:22:21] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet [16:22:22] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [16:22:56] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah) [16:23:32] (03PS6) 10Jforrester: wikifunctions: Add Python test call back in to test script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) [16:23:37] (03CR) 10Jforrester: [C:03+2] wikifunctions: Add Python test call back in to test script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [16:25:28] (03Merged) 10jenkins-bot: wikifunctions: Add Python test call back in to test script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204603 (https://phabricator.wikimedia.org/T406848) (owner: 10Jforrester) [16:26:09] !log sudo cumin "O:installserver" "run-puppet-agent" [16:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:31] (03PS3) 10Majavah: P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 [16:26:55] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:26:59] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1001.eqiad.wmnet [16:27:34] (03CR) 10Jforrester: "For consideration." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) (owner: 10Jforrester) [16:28:22] 14SRE-Sprint-Week-Sustainability-March2023, 06collaboration-services, 10Phabricator, 06serviceops-radar, and 2 others: Phabricator: Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879#11367463 (10Aklapper) To get closer to this in the long run, "Prevent write queries from... [16:28:58] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah) [16:29:17] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:30:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Fixing grants [16:30:24] (03PS4) 10Majavah: P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 [16:30:28] (03CR) 10Elukey: [C:03+1] Remove cumin1002 as Homer git peer [puppet] - 10https://gerrit.wikimedia.org/r/1204622 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [16:32:52] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah) [16:33:13] (03PS8) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [16:33:40] (03PS5) 10Majavah: P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 [16:33:56] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:35:27] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy1001.wikimedia.org [16:35:28] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [16:35:35] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:37:29] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy2001.wikimedia.org [16:38:33] (03PS9) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [16:38:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:39:15] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:40:31] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [16:41:19] jouncebot: nowandnext [16:41:19] No deployments scheduled for the next 1 hour(s) and 18 minute(s) [16:41:19] In 1 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1800) [16:41:32] I’ll try to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1202164 unless someone tells me not to [16:41:33] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy1001.wikimedia.org - sukhe@cumin1003" [16:41:51] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy1001.wikimedia.org - sukhe@cumin1003" [16:41:51] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:51] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy1001.wikimedia.org on all recursors [16:41:55] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy1001.wikimedia.org on all recursors [16:42:04] (03CR) 10Marostegui: "Sounds good! I will wait for the decommissioning" [puppet] - 10https://gerrit.wikimedia.org/r/1204628 (https://phabricator.wikimedia.org/T389380) (owner: 10Marostegui) [16:42:22] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy1001.wikimedia.org - sukhe@cumin1003" [16:42:26] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy1001.wikimedia.org - sukhe@cumin1003" [16:42:47] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy1001.wikimedia.org with OS trixie [16:42:54] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [16:43:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) (owner: 10Lucas Werkmeister (WMDE)) [16:43:51] (03Merged) 10jenkins-bot: Enable the MEX / wbui2025 beta feature on testwikidata (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) (owner: 10Lucas Werkmeister (WMDE)) [16:44:00] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy2001.wikimedia.org - sukhe@cumin1003" [16:44:04] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy2001.wikimedia.org - sukhe@cumin1003" [16:44:04] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:44:04] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy2001.wikimedia.org on all recursors [16:44:08] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy2001.wikimedia.org on all recursors [16:44:23] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1202164|Enable the MEX / wbui2025 beta feature on testwikidata (v2) (T407737)]] [16:44:28] T407737: [MEX] Add mobile editing for statments on Test Wikidata - https://phabricator.wikimedia.org/T407737 [16:44:40] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy2001.wikimedia.org - sukhe@cumin1003" [16:44:44] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy2001.wikimedia.org - sukhe@cumin1003" [16:45:47] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy2001.wikimedia.org with OS trixie [16:45:55] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [16:46:31] (03PS2) 10Bking: Configure reprepro to mirror upstream opensearch2 and opensearch3 repos [puppet] - 10https://gerrit.wikimedia.org/r/1196949 (https://phabricator.wikimedia.org/T407123) (owner: 10Btullis) [16:47:00] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1202164|Enable the MEX / wbui2025 beta feature on testwikidata (v2) (T407737)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:47:07] testing… [16:47:07] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy1002.wikimedia.org [16:47:09] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [16:47:18] (03PS2) 10Kamila Součková: mw-cron: enable email_verification_reminder [puppet] - 10https://gerrit.wikimedia.org/r/1204616 [16:47:38] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy2002.wikimedia.org [16:48:16] (03CR) 10Kamila Součková: "Sorry, our comments crossed '^^ Anyway:" [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [16:49:39] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:49:40] (03CR) 10FNegri: maintain-dbusers: add stat for last run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [16:50:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196949 (https://phabricator.wikimedia.org/T407123) (owner: 10Btullis) [16:50:34] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy1002.wikimedia.org - sukhe@cumin1003" [16:50:42] seems to work afaict [16:50:47] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [16:50:53] (03CR) 10Jcrespo: [C:03+1] wmf_root_client.pp: Remove cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/1204628 (https://phabricator.wikimedia.org/T389380) (owner: 10Marostegui) [16:50:55] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy1002.wikimedia.org - sukhe@cumin1003" [16:50:55] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:50:55] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy1002.wikimedia.org on all recursors [16:50:59] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy1002.wikimedia.org on all recursors [16:51:02] (03CR) 10FNegri: maintain_dbusers: add basic alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [16:51:25] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [16:51:31] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy1002.wikimedia.org - sukhe@cumin1003" [16:51:47] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy1002.wikimedia.org - sukhe@cumin1003" [16:51:49] (03CR) 10Dreamy Jazz: "Thanks, that is a fair bit confusing that there are two files but having these comments will definitely help :D" [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [16:51:58] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy1002.wikimedia.org with OS trixie [16:52:06] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367610 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [16:52:09] (03CR) 10Dreamy Jazz: [C:03+1] mw-cron: enable email_verification_reminder [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [16:52:22] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy1001.wikimedia.org with reason: host reimage [16:54:05] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on dse-k8s-worker1003.eqiad.wmnet with reason: C/D Migration [16:54:45] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy2002.wikimedia.org - sukhe@cumin1003" [16:54:49] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy2002.wikimedia.org - sukhe@cumin1003" [16:54:49] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:54:49] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy2002.wikimedia.org on all recursors [16:54:53] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy2002.wikimedia.org on all recursors [16:55:24] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy2002.wikimedia.org - sukhe@cumin1003" [16:55:28] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy2002.wikimedia.org - sukhe@cumin1003" [16:55:57] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy2002.wikimedia.org with OS trixie [16:56:04] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367653 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [16:56:27] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11367655 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:57:03] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202164|Enable the MEX / wbui2025 beta feature on testwikidata (v2) (T407737)]] (duration: 12m 40s) [16:57:07] T407737: [MEX] Add mobile editing for statments on Test Wikidata - https://phabricator.wikimedia.org/T407737 [16:57:21] * Lucas_WMDE done deploying [16:57:39] (03CR) 10Clément Goubert: [C:03+1] mw-cron: enable email_verification_reminder [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [16:57:53] (03CR) 10Bking: [C:03+2] Configure reprepro to mirror upstream opensearch2 and opensearch3 repos [puppet] - 10https://gerrit.wikimedia.org/r/1196949 (https://phabricator.wikimedia.org/T407123) (owner: 10Btullis) [16:58:11] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy1001.wikimedia.org with reason: host reimage [17:00:26] (03CR) 10David Caro: maintain-dbusers: add stat for last run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [17:00:51] (03CR) 10David Caro: maintain_dbusers: add basic alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [17:02:12] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1024.eqiad.wmnet with reason: C/D Migration [17:02:37] (03PS4) 10David Caro: maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) [17:02:56] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy1002.wikimedia.org with reason: host reimage [17:03:02] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy2001.wikimedia.org with reason: host reimage [17:04:13] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ganeti1033.eqiad.wmnet with reason: C/D Migration [17:07:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11367719 (10RobH) >>! In T405945#11357966, @MoritzMuehlenhoff wrote: > @RobH ganeti1024 and ganeti1033 are drained and can be migrated. Migration... [17:08:26] (03CR) 10Andrew Bogott: [C:03+1] "As discussed on IRC, it seems polite to warn users of the network flap before we apply this." [puppet] - 10https://gerrit.wikimedia.org/r/1204623 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [17:08:39] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy1002.wikimedia.org with reason: host reimage [17:09:04] (03PS1) 10Bking: Revert "Configure reprepro to mirror upstream opensearch2 and opensearch3 repos" [puppet] - 10https://gerrit.wikimedia.org/r/1204633 [17:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:13] (03CR) 10Bking: [V:03+2 C:03+2] Revert "Configure reprepro to mirror upstream opensearch2 and opensearch3 repos" [puppet] - 10https://gerrit.wikimedia.org/r/1204633 (owner: 10Bking) [17:10:06] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on es1046.eqiad.wmnet with reason: C/D Migration [17:10:14] (03CR) 10Brouberol: [C:03+1] deployment_server: migrate mediawiki-dumps-legacy to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203578 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:10:49] !log eqiad c/d migration work in D6 [17:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:10] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy2001.wikimedia.org with reason: host reimage [17:11:53] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy1001.wikimedia.org with OS trixie [17:11:53] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy1001.wikimedia.org [17:12:02] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [17:12:56] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1233.eqiad.wmnet with reason: C/D Migration [17:13:16] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy2002.wikimedia.org with reason: host reimage [17:13:22] (03CR) 10Kamila Součková: [C:03+2] "I'm afraid I'd completely forgotten about the two files, so yeah, agreed :D I now recall that there was a good reason for it, but I'm not " [puppet] - 10https://gerrit.wikimedia.org/r/1204616 (owner: 10Kamila Součková) [17:15:51] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on es1053.eqiad.wmnet with reason: C/D Migration [17:16:20] (03CR) 10FNegri: [C:03+1] maintain-dbusers: add stat for last run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [17:17:21] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1259.eqiad.wmnet with reason: C/D Migration [17:17:36] (03CR) 10FNegri: [C:03+1] maintain-dbusers: add stat for last run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1204381 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [17:18:00] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy2002.wikimedia.org with reason: host reimage [17:18:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on restbase1045.eqiad.wmnet with reason: C/D Migration [17:19:37] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1258.eqiad.wmnet with reason: C/D Migration [17:19:58] (03CR) 10FNegri: [C:03+1] maintain_dbusers: add basic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1204575 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [17:20:16] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aqs1022.eqiad.wmnet with reason: C/D Migration [17:21:21] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4001.wikimedia.org [17:21:23] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [17:21:36] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org [17:21:37] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy4002.wikimedia.org [17:22:06] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy1002.wikimedia.org with OS trixie [17:22:06] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy1002.wikimedia.org [17:22:13] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [17:22:14] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aqs1022.eqiad.wmnet with reason: C/D Migration [17:24:01] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1174.eqiad.wmnet with reason: C/D Migration [17:25:30] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on dbproxy1025.eqiad.wmnet with reason: C/D Migration [17:25:50] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [17:26:29] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [17:26:39] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1224.eqiad.wmnet with reason: C/D Migration [17:26:43] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Disable limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204619 (owner: 10Clément Goubert) [17:27:30] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4001.wikimedia.org - sukhe@cumin1003" [17:27:53] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1225.eqiad.wmnet with reason: C/D Migration [17:29:02] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy2001.wikimedia.org with OS trixie [17:29:02] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy2001.wikimedia.org [17:29:08] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [17:29:25] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4001.wikimedia.org - sukhe@cumin1003" [17:29:25] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:29:26] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy4001.wikimedia.org on all recursors [17:29:29] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy4001.wikimedia.org on all recursors [17:29:48] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on sessionstore1006.eqiad.wmnet with reason: C/D Migration [17:30:18] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy5001.wikimedia.org [17:30:20] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [17:34:49] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy2002.wikimedia.org with OS trixie [17:34:49] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy2002.wikimedia.org [17:34:55] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [17:35:59] sukhe@cumin1003 makevm (PID 2640311) is awaiting input [17:38:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11367855 (10RobH) @btullis, I wanted to move dse-k8s-worker1010.eqiad.wmnet today as part of the migration, but the detai... [17:39:42] !log eqiad c/d migration d6 rack complete for today, onto d3 [17:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:15] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1247.eqiad.wmnet with reason: C/D Migration [17:41:05] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5001.wikimedia.org - sukhe@cumin1003" [17:41:10] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5001.wikimedia.org - sukhe@cumin1003" [17:41:10] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:10] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy5001.wikimedia.org on all recursors [17:41:13] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy5001.wikimedia.org on all recursors [17:41:50] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy5001.wikimedia.org - sukhe@cumin1003" [17:41:54] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy5001.wikimedia.org - sukhe@cumin1003" [17:43:23] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on es1052.eqiad.wmnet with reason: C/D Migration [17:44:09] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1175.eqiad.wmnet with reason: C/D Migration [17:44:55] sukhe@cumin1003 makevm (PID 2640311) is awaiting input [17:45:00] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Wed 10 Dec 2025 05:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [17:46:57] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1232.eqiad.wmnet with reason: C/D Migration [17:47:31] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4001.wikimedia.org - sukhe@cumin1003" [17:47:35] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4001.wikimedia.org - sukhe@cumin1003" [17:48:04] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1173.eqiad.wmnet with reason: C/D Migration [17:49:17] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-druid1005.eqiad.wmnet with reason: C/D Migration [17:49:45] sukhe@cumin1003 makevm (PID 2640311) is awaiting input [17:50:00] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4001.wikimedia.org with OS trixie [17:50:10] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [17:50:45] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on rdb1014.eqiad.wmnet with reason: C/D Migration [17:51:35] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy5001.wikimedia.org with OS trixie [17:51:42] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11367922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [17:51:44] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1223.eqiad.wmnet with reason: C/D Migration [17:52:41] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on pki-root1001.eqiad.wmnet with reason: C/D Migration [17:54:29] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on kubestage1004.eqiad.wmnet with reason: C/D Migration [18:00:05] swfrench-wmf: That opportune time for a MediaWiki infrastructure (UTC late) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1800). [18:00:12] o/ [18:00:19] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1204620 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [18:04:39] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-presto1019.eqiad.wmnet with reason: C/D Migration [18:08:31] (03CR) 10Scott French: [C:03+2] deployment_server: fully migrate mw-(api-ext|web) to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203559 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:10:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11367977 (10RobH) a:05cmooney→03MoritzMuehlenhoff >>! In T405945#11357966, @MoritzMuehlenhoff wrote: > @RobH ganeti1024 and ganeti1033 are dra... [18:11:44] (03CR) 10Dzahn: [C:03+1] "yep, same key that has been removed before by request of the user" [puppet] - 10https://gerrit.wikimedia.org/r/1204586 (owner: 10Awight) [18:11:51] (03CR) 10Dzahn: [C:03+2] Revert "Temporarily revoke ssh access for awight" [puppet] - 10https://gerrit.wikimedia.org/r/1204586 (owner: 10Awight) [18:12:30] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cp1112.eqiad.wmnet with reason: C/D Migration [18:13:21] (03PS1) 10Bking: Configure reprepro to mirror upstream opensearch2 and opensearch3 repos [puppet] - 10https://gerrit.wikimedia.org/r/1204639 (https://phabricator.wikimedia.org/T407123) [18:14:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1204639 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking) [18:14:47] (03CR) 10Cory Massaro: "This seems very reasonable to me, but I want to see what the default of 3s looks like for a bit. It's possible that I've made some bad ass" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) (owner: 10Jforrester) [18:15:06] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1136.eqiad.wmnet with reason: C/D Migration [18:16:30] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on backup1014.eqiad.wmnet with reason: C/D Migration [18:16:43] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11368002 (10dr0ptp4kt) Following up on Meet with @elukey today, here are the suggested alerting targets for the SLOs: Email: `data-engineering-alerts@ !log swfrench@deploy2002 Started scap sync-world: Fully migrate mw-(api-ext|web) to 8.3 - T405955 [18:19:05] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:22:03] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [18:22:27] !log swfrench@deploy2002 Finished scap sync-world: Fully migrate mw-(api-ext|web) to 8.3 - T405955 (duration: 03m 51s) [18:22:32] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc-gp1006.eqiad.wmnet with reason: C/D Migration [18:24:25] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cp1115.eqiad.wmnet with reason: C/D Migration [18:24:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:20] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host franio1004 [18:25:24] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host franio1004 [18:25:56] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1248.eqiad.wmnet with reason: C/D Migration [18:27:22] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:27:54] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on logstash1035.eqiad.wmnet with reason: C/D Migration [18:28:04] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): return capacity from migration to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203571 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:28:22] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:28:36] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4001.wikimedia.org with reason: host reimage [18:30:04] (03Merged) 10jenkins-bot: mw-(api-ext|web): return capacity from migration to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203571 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:30:21] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1249.eqiad.wmnet with reason: C/D Migration [18:32:17] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ml-serve1004.eqiad.wmnet with reason: C/D Migration [18:32:33] FYI, I'm going to be applying some capacity changes between mediawiki deployments, during which time scap deployments should not happen. [18:32:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4001.wikimedia.org with reason: host reimage [18:32:55] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy6001.wikimedia.org [18:32:56] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [18:33:06] while I don't expect that to happen during this window, I'll still take the lock out of an abundance of caution [18:33:23] !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Hold deployments during mw-(api-ext|web) capacity changes - T405955 [18:33:26] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:34:03] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc-wf1002.eqiad.wmnet with reason: C/D Migration [18:36:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on logstash1034.eqiad.wmnet with reason: C/D Migration [18:36:37] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:36:44] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy6001.wikimedia.org - sukhe@cumin1003" [18:36:57] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:38:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:38:28] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:39:01] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:39:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy6001.wikimedia.org - sukhe@cumin1003" [18:39:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:39:05] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy6001.wikimedia.org on all recursors [18:39:08] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy6001.wikimedia.org on all recursors [18:39:19] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:39:28] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:39:31] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy6001.wikimedia.org - sukhe@cumin1003" [18:39:36] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy6001.wikimedia.org - sukhe@cumin1003" [18:39:38] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:39:49] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy6001.wikimedia.org with OS trixie [18:39:56] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11368129 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [18:40:40] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:40:57] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:41:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:41:20] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:41:24] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on logging-hd1003.eqiad.wmnet with reason: C/D Migration [18:42:39] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:42:52] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:42:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:43:07] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:45:00] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy5001.wikimedia.org with reason: host reimage [18:45:19] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:45:32] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:45:37] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:45:46] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:46:09] (03CR) 10Cathal Mooney: [C:03+2] Eqiad row c: move vlan gateways to ports facing the Nokia spines [homer/public] - 10https://gerrit.wikimedia.org/r/1202729 (https://phabricator.wikimedia.org/T405579) (owner: 10Cathal Mooney) [18:46:42] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aqs1015.eqiad.wmnet with reason: C/D Migration [18:47:21] (03Merged) 10jenkins-bot: Eqiad row c: move vlan gateways to ports facing the Nokia spines [homer/public] - 10https://gerrit.wikimedia.org/r/1202729 (https://phabricator.wikimedia.org/T405579) (owner: 10Cathal Mooney) [18:47:28] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a4 [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204646 (https://phabricator.wikimedia.org/T409607) [18:48:14] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on restbase1042.eqiad.wmnet with reason: C/D Migration [18:48:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4001.wikimedia.org with OS trixie [18:48:20] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy4001.wikimedia.org [18:48:33] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:48:33] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11368156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [18:48:43] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy5001.wikimedia.org with reason: host reimage [18:48:54] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:48:54] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org [18:48:54] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy4002.wikimedia.org [18:49:07] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:49:14] (03PS2) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a4 [vendor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204645 (https://phabricator.wikimedia.org/T409910) [18:49:15] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.23.0-a4 [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204646 (https://phabricator.wikimedia.org/T409607) (owner: 10C. Scott Ananian) [18:49:25] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:49:54] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org [18:49:59] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy4002.wikimedia.org [18:50:05] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:50:25] (03CR) 10C. Scott Ananian: [C:03+1] Bump wikimedia/parsoid to 0.23.0-a4 [vendor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204645 (https://phabricator.wikimedia.org/T409910) (owner: 10C. Scott Ananian) [18:50:29] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:50:41] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:50:54] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:51:59] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:52:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:52:24] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:52:39] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:53:06] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:53:23] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:53:28] (03PS1) 10Dwisehaupt: Alertmanager: Add fr-tech-ops and update fr-tech groups [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) [18:53:30] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:53:41] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:55:57] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:56:12] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:56:20] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:56:32] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:56:43] !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during mw-(api-ext|web) capacity changes - T405955 (duration: 23m 20s) [18:56:47] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:58:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11368174 (10RobH) Day 3 (Monday) : No Migrations, catch up day for other tasks for both Rob and John. Tuesday Holiday doesn't count Day 4 Update (Wednesday):... [18:58:50] alright, my changes are complete and the lock is released [19:00:05] andre and jeena: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T1900). [19:00:14] o/ Here we go again [19:00:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11368177 (10RobH) [19:00:36] 06SRE, 06Infrastructure-Foundations, 10netops: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11368178 (10RobH) [19:00:51] I'm going to try again to deploy wmf.2 to group1 [19:01:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11368182 (10RobH) [19:01:26] !log eqiad c/d migrations complete for today [19:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:53] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204650 (https://phabricator.wikimedia.org/T408272) [19:01:55] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204650 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [19:02:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11368192 (10ssingh) >>! In T405623#11298598, @RobH wrote: > Please note this migration has shifted from Oct 15th start date to Nov 1 start date. Hi @RobH. Asking for planning: when is... [19:02:20] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [19:03:01] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204646 (https://phabricator.wikimedia.org/T409607) (owner: 10C. Scott Ananian) [19:03:28] (03CR) 10C. Scott Ananian: [C:03+1] Bump wikimedia/parsoid to 0.23.0-a4 [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204646 (https://phabricator.wikimedia.org/T409607) (owner: 10C. Scott Ananian) [19:03:54] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204650 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [19:03:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11368198 (10RobH) All cp hosts in rows C/D have been migrated as of today (last ones done) and all that is left in #traffic realm for migration is dns1006 and lvs1020 via T405602. We... [19:04:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11368200 (10RobH) For dns1006, since the downtime of the host is around 5-12 seconds (missing about 5-12 seq numbers via ping) I'm not sure it even has to be fully depooled as long as... [19:08:46] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy6001.wikimedia.org with reason: host reimage [19:09:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy5001.wikimedia.org with OS trixie [19:09:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy5001.wikimedia.org [19:09:11] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11368206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [19:10:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11368208 (10ssingh) >>! In T405623#11368198, @RobH wrote: > All cp hosts in rows C/D have been migrated as of today (last ones done) and all that is left in #traffic realm for migratio... [19:10:33] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.2 refs T408272 [19:10:38] T408272: 1.46.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T408272 [19:11:25] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy5002.wikimedia.org [19:11:27] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [19:12:57] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy6001.wikimedia.org with reason: host reimage [19:14:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204646 (https://phabricator.wikimedia.org/T409607) (owner: 10C. Scott Ananian) [19:15:07] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5002.wikimedia.org - sukhe@cumin1003" [19:16:08] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy5002.wikimedia.org - sukhe@cumin1003" [19:16:08] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:16:08] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy5002.wikimedia.org on all recursors [19:16:12] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy5002.wikimedia.org on all recursors [19:16:38] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy5002.wikimedia.org - sukhe@cumin1003" [19:16:42] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy5002.wikimedia.org - sukhe@cumin1003" [19:18:11] (03PS4) 10Ladsgroup: mysql: Rename cookbook to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 [19:19:14] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy5002.wikimedia.org with OS trixie [19:19:25] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11368237 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [19:26:04] (03CR) 10CI reject: [V:04-1] mysql: Rename cookbook to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [19:27:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:28:22] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:30:17] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy6001.wikimedia.org with OS trixie [19:30:17] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy6001.wikimedia.org [19:30:28] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11368261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [19:30:32] andre, jeena : i have a backport patch for T409910, which is listed as a train blocker. do we want to spiderpig that now, or wait until the backport window in 90 min? [19:30:33] T409910: PHP Warning: foreach() argument must be of type array|object, null given / PHP Warning: Undefined array key "extensionData" - https://phabricator.wikimedia.org/T409910 [19:31:11] cscott: Train looks stable so in my opinion you could go ahead [19:31:22] just what I was typing out :P [19:32:10] ok, spiderpig powers activate [19:32:20] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy6002.wikimedia.org [19:32:21] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [19:34:07] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:34:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [vendor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204645 (https://phabricator.wikimedia.org/T409910) (owner: 10C. Scott Ananian) [19:34:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204646 (https://phabricator.wikimedia.org/T409607) (owner: 10C. Scott Ananian) [19:37:32] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy6002.wikimedia.org - sukhe@cumin1003" [19:38:13] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy6002.wikimedia.org - sukhe@cumin1003" [19:38:13] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:38:13] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy6002.wikimedia.org on all recursors [19:38:17] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy6002.wikimedia.org on all recursors [19:38:41] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy6002.wikimedia.org - sukhe@cumin1003" [19:38:45] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy6002.wikimedia.org - sukhe@cumin1003" [19:38:56] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy6002.wikimedia.org with OS trixie [19:39:07] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11368278 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10... [19:49:18] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a4 [vendor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204645 (https://phabricator.wikimedia.org/T409910) (owner: 10C. Scott Ananian) [19:49:22] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a4 [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204646 (https://phabricator.wikimedia.org/T409607) (owner: 10C. Scott Ananian) [19:49:55] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1204645|Bump wikimedia/parsoid to 0.23.0-a4 (T409910 T409607)]], [[gerrit:1204646|Bump wikimedia/parsoid to 0.23.0-a4 (T409607)]] [19:50:00] T409910: PHP Warning: foreach() argument must be of type array|object, null given / PHP Warning: Undefined array key "extensionData" - https://phabricator.wikimedia.org/T409910 [19:50:00] T409607: CTT tasks week of 2025-11-07 - https://phabricator.wikimedia.org/T409607 [19:53:45] !log cscott@deploy2002 cscott: Backport for [[gerrit:1204645|Bump wikimedia/parsoid to 0.23.0-a4 (T409910 T409607)]], [[gerrit:1204646|Bump wikimedia/parsoid to 0.23.0-a4 (T409607)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:53:57] ok, testing [19:55:50] verified, continuing [19:55:54] !log cscott@deploy2002 cscott: Continuing with sync [19:56:28] (03PS3) 10Scott French: trafficserver: disable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1203569 (https://phabricator.wikimedia.org/T405955) [19:56:30] (03PS3) 10Scott French: mw-(api-ext|web): return main to nominal multi-DC size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203572 (https://phabricator.wikimedia.org/T405955) [19:56:31] (03PS3) 10Scott French: rest-gateway: Stop diverting PHP_ENGINE=8.3 to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203573 (https://phabricator.wikimedia.org/T405955) [19:56:32] (03PS3) 10Scott French: mw-(api-ext|web): return next to "idle" size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203574 (https://phabricator.wikimedia.org/T405955) [20:01:27] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://phabricator.wikimedia.org/T407510#11368375 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This was from a port that was schedualed to be taken down. [20:01:30] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204645|Bump wikimedia/parsoid to 0.23.0-a4 (T409910 T409607)]], [[gerrit:1204646|Bump wikimedia/parsoid to 0.23.0-a4 (T409607)]] (duration: 11m 35s) [20:01:36] T409910: PHP Warning: foreach() argument must be of type array|object, null given / PHP Warning: Undefined array key "extensionData" - https://phabricator.wikimedia.org/T409910 [20:01:36] T409607: CTT tasks week of 2025-11-07 - https://phabricator.wikimedia.org/T409607 [20:01:41] ok, all done. [20:02:10] (03PS1) 10Kosta Harlan: EventLogging: Expand on no-js logging [extensions/WikiEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204662 (https://phabricator.wikimedia.org/T409779) [20:02:56] cscott: I'll do a backport then [20:03:07] jeena / andre is that ok? [20:03:23] Yes yes, the train is fine and done already [20:03:34] (03PS1) 10Kosta Harlan: EventLogging: Expand on no-js logging [extensions/WikiEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204663 (https://phabricator.wikimedia.org/T409779) [20:03:40] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM! Thank you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203451 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [20:03:43] works for me, i'm done [20:04:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204663 (https://phabricator.wikimedia.org/T409779) (owner: 10Kosta Harlan) [20:04:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204662 (https://phabricator.wikimedia.org/T409779) (owner: 10Kosta Harlan) [20:08:23] FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:08:35] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy6002.wikimedia.org with reason: host reimage [20:09:30] andrew@cumin2002 reimage (PID 101251) is awaiting input [20:10:21] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy5002.wikimedia.org with reason: host reimage [20:12:51] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy6002.wikimedia.org with reason: host reimage [20:13:23] RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11368426 (10RobH) IRC Update: Chatted with Sukhbir about this just updating the task for reference: Current plan is for John and I to knock out the remainder of all hosts that we can... [20:16:02] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy5002.wikimedia.org with reason: host reimage [20:17:04] (03Merged) 10jenkins-bot: EventLogging: Expand on no-js logging [extensions/WikiEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204663 (https://phabricator.wikimedia.org/T409779) (owner: 10Kosta Harlan) [20:17:06] (03Merged) 10jenkins-bot: EventLogging: Expand on no-js logging [extensions/WikiEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204662 (https://phabricator.wikimedia.org/T409779) (owner: 10Kosta Harlan) [20:17:43] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204663|EventLogging: Expand on no-js logging (T409779 T263505)]], [[gerrit:1204662|EventLogging: Expand on no-js logging (T409779 T263505)]] [20:17:48] T409779: [SPIKE] Review hCaptcha measurement plan - https://phabricator.wikimedia.org/T409779 [20:17:49] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org [20:17:49] T263505: Create logging instrumentation for Wikitext editor not affected by ad blockers - https://phabricator.wikimedia.org/T263505 [20:17:49] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy4002.wikimedia.org [20:20:22] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1204663|EventLogging: Expand on no-js logging (T409779 T263505)]], [[gerrit:1204662|EventLogging: Expand on no-js logging (T409779 T263505)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:23:38] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11368447 (10Andrew) I just ran a couple more tests: 1) Installed host, paused at the end of install 2) Installed new grub packages (grub-common, grub2-common, grub-pc-... [20:23:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11368448 (10RobH) IRC Update from chat with Tobias: ml-cache1002 & ml-serve1004 have both been migrated to the new switch stacks. ml-serve1003 will be drained t... [20:23:47] !log kharlan@deploy2002 kharlan: Continuing with sync [20:27:58] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204663|EventLogging: Expand on no-js logging (T409779 T263505)]], [[gerrit:1204662|EventLogging: Expand on no-js logging (T409779 T263505)]] (duration: 10m 15s) [20:28:03] T409779: [SPIKE] Review hCaptcha measurement plan - https://phabricator.wikimedia.org/T409779 [20:28:03] T263505: Create logging instrumentation for Wikitext editor not affected by ad blockers - https://phabricator.wikimedia.org/T263505 [20:30:28] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy6002.wikimedia.org with OS trixie [20:30:28] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy6002.wikimedia.org [20:30:41] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11368465 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [20:30:48] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org [20:30:49] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy4002.wikimedia.org [20:30:57] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204666 [20:31:38] (03CR) 10Scott French: "+cc @cgoubert@wikimedia.org - FYI, in case we need to coordinate changes to rest-gateway." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203573 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:36:39] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy5002.wikimedia.org with OS trixie [20:36:39] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy5002.wikimedia.org [20:36:48] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11368489 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f... [20:36:58] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T409967 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [20:37:04] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409967 (10ops-monitoring-bot) 03NEW [20:41:33] (03PS1) 10Jforrester: wikifunctions: Drop old TODO, now TODON'T as we Declined the task [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204668 [20:44:07] (03CR) 10Jforrester: [C:03+2] wikifunctions: Drop old TODO, now TODON'T as we Declined the task [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204668 (owner: 10Jforrester) [20:46:12] (03Merged) 10jenkins-bot: wikifunctions: Drop old TODO, now TODON'T as we Declined the task [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204668 (owner: 10Jforrester) [21:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T2100). [21:00:04] cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:49] my patches are done already [21:01:56] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11368561 (10JVanderhoop-WMF) 05Open→03Resolved [21:02:25] I'll likely have one more patch to backport soon [21:02:26] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:03:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:03:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 6.200 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:08:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:10:21] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [21:24:15] (03PS4) 10Eevans: cassandra: create revise_tone_task_generator Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) [21:26:23] (03PS1) 10DLynch: EventLogging: Fix wikitext editor interface detection [extensions/WikiEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204678 (https://phabricator.wikimedia.org/T409779) [21:26:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikiEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204678 (https://phabricator.wikimedia.org/T409779) (owner: 10DLynch) [21:27:23] I've got that patch kostajh mentioned earlier. [21:28:12] (03PS1) 10DLynch: EventLogging: Fix wikitext editor interface detection [extensions/WikiEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204679 (https://phabricator.wikimedia.org/T409779) [21:28:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikiEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204679 (https://phabricator.wikimedia.org/T409779) (owner: 10DLynch) [21:29:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204678 (https://phabricator.wikimedia.org/T409779) (owner: 10DLynch) [21:29:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204679 (https://phabricator.wikimedia.org/T409779) (owner: 10DLynch) [21:35:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:40:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:40:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11368668 (10Andrew) a:05Andrew→03None [21:42:49] (03Merged) 10jenkins-bot: EventLogging: Fix wikitext editor interface detection [extensions/WikiEditor] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204678 (https://phabricator.wikimedia.org/T409779) (owner: 10DLynch) [21:42:50] (03Merged) 10jenkins-bot: EventLogging: Fix wikitext editor interface detection [extensions/WikiEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204679 (https://phabricator.wikimedia.org/T409779) (owner: 10DLynch) [21:43:26] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1204678|EventLogging: Fix wikitext editor interface detection (T409779)]], [[gerrit:1204679|EventLogging: Fix wikitext editor interface detection (T409779)]] [21:43:30] T409779: [SPIKE] Review hCaptcha measurement plan - https://phabricator.wikimedia.org/T409779 [21:44:52] (03PS1) 10Eevans: Add (fake) revise_tone_task_generator password [labs/private] - 10https://gerrit.wikimedia.org/r/1204682 [21:45:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:46:07] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1204678|EventLogging: Fix wikitext editor interface detection (T409779)]], [[gerrit:1204679|EventLogging: Fix wikitext editor interface detection (T409779)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:46:41] (03CR) 10Eevans: [V:03+2 C:03+2] Add (fake) revise_tone_task_generator password [labs/private] - 10https://gerrit.wikimedia.org/r/1204682 (owner: 10Eevans) [21:47:03] !log kemayo@deploy2002 kemayo: Continuing with sync [21:47:29] (03CR) 10Eevans: [C:03+2] cassandra: create revise_tone_task_generator Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) (owner: 10Eevans) [21:47:30] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:48:30] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 8.920 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:50:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:51:14] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204678|EventLogging: Fix wikitext editor interface detection (T409779)]], [[gerrit:1204679|EventLogging: Fix wikitext editor interface detection (T409779)]] (duration: 07m 48s) [21:51:18] T409779: [SPIKE] Review hCaptcha measurement plan - https://phabricator.wikimedia.org/T409779 [21:51:30] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:52:22] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 2.352 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:55:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:57:37] (03CR) 10Dzahn: "If we add a dependency on a puppetdb it means we can't have a test setup in cloud unless we build and maintain our own local puppetdb in t" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T2200) [22:00:17] (03CR) 10Dzahn: "Ignore my previous comment if this is only in prometheus::ops code and that isn't applied on the actual gerrit server." [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [22:04:14] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [22:04:14] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [22:04:20] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: connect to address wikitech-static.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [22:05:14] RECOVERY - HTTPS-wikitech-static on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 45 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [22:05:14] RECOVERY - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 45 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [22:05:20] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:09:34] (03PS1) 10Dzahn: fail-over releases.wikimedia.org backend [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) [22:15:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:17:22] (03PS1) 10Eevans: data-gateway: deploy v1.0.13 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204687 (https://phabricator.wikimedia.org/T401021) [22:20:13] (03CR) 10Eevans: [C:03+2] data-gateway: deploy v1.0.13 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204687 (https://phabricator.wikimedia.org/T401021) (owner: 10Eevans) [22:22:08] (03Merged) 10jenkins-bot: data-gateway: deploy v1.0.13 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204687 (https://phabricator.wikimedia.org/T401021) (owner: 10Eevans) [22:24:42] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [22:24:57] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [22:25:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:25:16] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [22:25:36] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [22:30:02] RESOLVED: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:30:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:36:57] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T409980 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [22:37:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409980 (10ops-monitoring-bot) 03NEW [22:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251112T2300) [23:01:41] (03PS2) 10Aaron Schulz: Sandbox cleanup for the Wikimedia REST APIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) [23:03:58] (03CR) 10Aaron Schulz: Sandbox cleanup for the Wikimedia REST APIs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz) [23:29:07] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:34:07] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:48:31] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [23:50:52] (03PS1) 10Bvibber: Reduce number of bucketsizes for MediaViewer (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204700 (https://phabricator.wikimedia.org/T372165) [23:56:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.241 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static