[00:02:12] <wikibugs>	 (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1207293
[00:02:15] <wikibugs>	 (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207294
[00:02:19] <wikibugs>	 (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207295
[00:08:23] <wikibugs>	 (03PS1) 10Dzahn: switch historic Subversion URLs from Phabricator to static-codereview [puppet] - 10https://gerrit.wikimedia.org/r/1207296
[00:09:48] <wikibugs>	 (03PS2) 10Dzahn: switch historic Subversion URLs from Phabricator to static-codereview [puppet] - 10https://gerrit.wikimedia.org/r/1207296
[00:11:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "yea, whois has our name servers but needs NS in DNS" [dns] - 10https://gerrit.wikimedia.org/r/1207293 (owner: 10Ncmonitor)
[00:12:26] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks as always for the docs links!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus)
[00:12:50] <wikibugs>	 (03CR) 10Scott French: [C:03+1] kubernetes: Set default Envoy version to 1.32.12 [puppet] - 10https://gerrit.wikimedia.org/r/1207289 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus)
[00:12:56] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "seems like another candidate for the "dont-pay-for-wikipedia-articles"" [puppet] - 10https://gerrit.wikimedia.org/r/1207294 (owner: 10Ncmonitor)
[00:13:41] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks for cleaning this up!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194743 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:14:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "yea, whois has our NS" [puppet] - 10https://gerrit.wikimedia.org/r/1207295 (owner: 10Ncmonitor)
[00:16:38] <wikibugs>	 (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207294 (owner: 10Ncmonitor)
[00:19:47] <wikibugs>	 (03PS1) 10Dzahn: admin: transfer group approver for releasers-mediawiki to Mateus Santos [puppet] - 10https://gerrit.wikimedia.org/r/1207304
[00:20:33] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:20] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207305
[00:23:27] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:30:10] <wikibugs>	 (03PS1) 10Dzahn: admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313
[00:33:45] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207320
[00:36:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11390595 (10Scott_French) @RobH - Thanks for checking!  I'll also be out 12-01. I see you mentioned 11-21, but that's Friday. Did you mean Monday 11-24?  If so, that (11-24) sou...
[00:38:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11390608 (10RobH) I totally messed up the dates on your comment:  2025-11-20, 2025-11-24, 2025-11-25 2025-12-03, 2025-12-04  So yeah, we can plan for the 24th (monday) no problem!
[00:38:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207294 (owner: 10Ncmonitor)
[00:38:57] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[00:40:06] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1207293 (owner: 10Ncmonitor)
[00:40:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1207321
[00:40:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1207321 (owner: 10TrainBranchBot)
[00:40:24] <logmsgbot>	 !log brett@dns1006 START - running authdns-update
[00:40:44] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] kartotherian, tegola-vector-tiles: Remove unused tcp_health_check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194743 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:41:24] <logmsgbot>	 !log brett@dns1006 END - running authdns-update
[00:42:35] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11390612 (10Ladsgroup) If you want to, I'll be around Thursday and Friday of this week and I can depool them for you. I can also do the 10G switch too (but...
[00:43:00] <wikibugs>	 (03Merged) 10jenkins-bot: kartotherian, tegola-vector-tiles: Remove unused tcp_health_check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194743 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:46:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11390638 (10Scott_French) Ack, Monday 2025-11-24 it is. Thank you!
[00:53:50] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1207321 (owner: 10TrainBranchBot)
[00:59:18] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 3 days, 0:00:00 on db1159.eqiad.wmnet with reason: Maintenance
[00:59:36] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1185* gradually with 4 steps - Work done
[01:00:43] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:03:15] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[01:03:23] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T410589)', diff saved to https://phabricator.wikimedia.org/P85397 and previous config saved to /var/cache/conftool/dbconfig/20251120-010322-ladsgroup.json
[01:03:27] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[01:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:10:19] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1207336
[01:10:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1207336 (owner: 10TrainBranchBot)
[01:14:16] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 32s)
[01:14:35] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:14:49] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:14:49] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:32:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:33:58] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1207336 (owner: 10TrainBranchBot)
[01:39:52] <logmsgbot>	 !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[01:40:26] <logmsgbot>	 !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[01:44:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:45:03] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1185* gradually with 4 steps - Work done
[02:42:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11390787 (10herron) a:05herron→03RobH >>! In T405946#11390399, @RobH wrote: > We don't want to move anything the day before a holiday or weekend, as it doesn't allow fo...
[02:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:44:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[03:54:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[04:38:57] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[05:08:24] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:10:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:13:55] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (cloudcontrol2010-dev), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:14:55] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T410589)', diff saved to https://phabricator.wikimedia.org/P85402 and previous config saved to /var/cache/conftool/dbconfig/20251120-051454-ladsgroup.json
[05:14:59] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[05:15:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:15:35] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:15:49] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:15:49] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:30:03] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P85403 and previous config saved to /var/cache/conftool/dbconfig/20251120-053002-ladsgroup.json
[05:33:24] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:34:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[05:39:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[05:45:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:45:10] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P85404 and previous config saved to /var/cache/conftool/dbconfig/20251120-054509-ladsgroup.json
[06:00:18] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T410589)', diff saved to https://phabricator.wikimedia.org/P85405 and previous config saved to /var/cache/conftool/dbconfig/20251120-060017-ladsgroup.json
[06:00:23] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[06:00:34] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[06:00:42] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T410589)', diff saved to https://phabricator.wikimedia.org/P85406 and previous config saved to /var/cache/conftool/dbconfig/20251120-060041-ladsgroup.json
[06:10:03] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[06:10:41] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11390886 (10Marostegui) @RobH as @Ladsgroup mentions, pc* hosts can only be done one at the time. I am out half today and Friday as oncall compensation. If...
[06:12:52] <wikibugs>	 (03PS1) 10Marostegui: db2144: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207492 (https://phabricator.wikimedia.org/T410480)
[06:13:55] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:14:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2144: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207492 (https://phabricator.wikimedia.org/T410480) (owner: 10Marostegui)
[06:14:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[06:15:03] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[06:20:00] <wikibugs>	 (03PS1) 10Marostegui: db1151: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207493
[06:20:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1151: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207493 (owner: 10Marostegui)
[06:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and federico3: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T0700).
[07:14:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[07:19:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[07:21:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms2 T410480', diff saved to https://phabricator.wikimedia.org/P85407 and previous config saved to /var/cache/conftool/dbconfig/20251120-072110-marostegui.json
[07:21:15] <stashbot>	 T410480: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480
[07:26:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudcephosd: move row C hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207739 (https://phabricator.wikimedia.org/T399180)
[07:26:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudcephosd: move row D hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207740 (https://phabricator.wikimedia.org/T399180)
[07:26:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudcephosd: move rack E4 hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207741 (https://phabricator.wikimedia.org/T399180)
[07:26:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudcephosd: move rack F4 hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207742 (https://phabricator.wikimedia.org/T399180)
[07:26:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudcephosd: move codfw hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207743 (https://phabricator.wikimedia.org/T399180)
[07:26:43] <wikibugs>	 (03PS1) 10DCausse: Revert "cirrus: start A/B test on completion with default_sort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744
[07:32:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:35:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: remove unused raid10-4dev-trixie.cfg and reuse-raid10-8dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1207749
[07:36:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Also reuse-raid10-8dev according to the comments is specific to kafka-main" [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi)
[07:44:39] <wikibugs>	 (03PS1) 10Marostegui: check_private_data_report: Add clouddb102[45] [puppet] - 10https://gerrit.wikimedia.org/r/1207757 (https://phabricator.wikimedia.org/T409557)
[07:45:09] <wikibugs>	 (03PS2) 10DCausse: Revert "cirrus: start A/B test on completion with default_sort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744
[07:45:09] <wikibugs>	 (03PS1) 10DCausse: cirrus: enable default_sort for completion on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207758 (https://phabricator.wikimedia.org/T404858)
[07:45:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 (owner: 10DCausse)
[07:45:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add clouddb102[45] [puppet] - 10https://gerrit.wikimedia.org/r/1207757 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[07:49:47] <wikibugs>	 (03PS3) 10DCausse: Revert "cirrus: start A/B test on completion with default_sort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 (https://phabricator.wikimedia.org/T404858)
[07:49:49] <wikibugs>	 (03PS2) 10DCausse: cirrus: enable default_sort for completion on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207758 (https://phabricator.wikimedia.org/T404858)
[07:53:13] <wikibugs>	 06SRE: Improve "reuse" feature for standard partman recipes - https://phabricator.wikimedia.org/T410601 (10fgiunchedi) 03NEW
[07:57:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T0800).
[08:00:05] <jouncebot>	 dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:11] <dcausse>	 o/
[08:00:15] <dcausse>	 I can deploy
[08:01:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse)
[08:02:40] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833)
[08:02:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "cirrus: start A/B test on completion with default_sort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse)
[08:04:16] <logmsgbot>	 !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1207744|Revert "cirrus: start A/B test on completion with default_sort" (T404858)]]
[08:04:20] <stashbot>	 T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858
[08:06:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[08:09:10] <logmsgbot>	 !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1207744|Revert "cirrus: start A/B test on completion with default_sort" (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:11:31] <kostajh>	 dcausse: I'll do a backport when you're done
[08:11:38] <dcausse>	 ack
[08:12:58] <logmsgbot>	 !log dcausse@deploy2002 dcausse: Continuing with sync
[08:13:02] <moritzm>	 !log installing squid security updates
[08:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:10] <logmsgbot>	 !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207744|Revert "cirrus: start A/B test on completion with default_sort" (T404858)]] (duration: 12m 54s)
[08:17:15] <stashbot>	 T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858
[08:17:46] <dcausse>	 kostajh: I'm done
[08:19:48] <kostajh>	 dcausse: thanks
[08:20:25] <kostajh>	 I need ~20 minutes or so before I can start
[08:21:37] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Add CIDRs enabling pod-to-pod communication. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207785 (https://phabricator.wikimedia.org/T408538)
[08:21:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[08:22:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1207786
[08:22:52] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.hosts.decommission: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1207122
[08:26:53] <wikibugs>	 (03CR) 10Muehlenhoff: "Per git history raid10-4dev-trixie.cfg was only recently added by Andrew for some test, adding him for comments" [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi)
[08:28:03] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Log the risk score for null edits differently [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207788 (https://phabricator.wikimedia.org/T410550)
[08:28:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207788 (https://phabricator.wikimedia.org/T410550) (owner: 10Kosta Harlan)
[08:28:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207788 (https://phabricator.wikimedia.org/T410550) (owner: 10Kosta Harlan)
[08:29:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1207304 (owner: 10Dzahn)
[08:32:37] <wikibugs>	 (03CR) 10Muehlenhoff: admin: deprecate the releasers-blubber group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn)
[08:32:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[08:35:47] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Log the risk score for null edits differently [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207788 (https://phabricator.wikimedia.org/T410550) (owner: 10Kosta Harlan)
[08:36:22] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1207788|hCaptcha: Log the risk score for null edits differently (T410550)]]
[08:36:27] <stashbot>	 T410550: hCaptcha: log risk score of null edits with other action than `edit` - https://phabricator.wikimedia.org/T410550
[08:36:29] <wikibugs>	 (03CR) 10Muehlenhoff: garage: Productionize garage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[08:37:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[08:38:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Indeed, IIRC that was added to debug https://phabricator.wikimedia.org/T407586 which is now fixed" [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi)
[08:38:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, few nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[08:38:57] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:39:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Makes sense. The reuse-raid10-8dev.cfg was only used for the now decommed old nodes and is no longer needed." [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi)
[08:40:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.hosts.decommission: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1207122 (owner: 10Muehlenhoff)
[08:40:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] install_server: remove unused raid10-4dev-trixie.cfg and reuse-raid10-8dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi)
[08:40:50] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1207788|hCaptcha: Log the risk score for null edits differently (T410550)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:41:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1203871 (owner: 10Elukey)
[08:43:10] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[08:45:29] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@f3216ec] (releasing): testing deploy to failover host
[08:45:59] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@f3216ec] (releasing): testing deploy to failover host (duration: 00m 30s)
[08:47:13] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207788|hCaptcha: Log the risk score for null edits differently (T410550)]] (duration: 10m 51s)
[08:47:18] <stashbot>	 T410550: hCaptcha: log risk score of null edits with other action than `edit` - https://phabricator.wikimedia.org/T410550
[08:51:41] <wikibugs>	 (03PS1) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528)
[08:53:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel)
[08:53:36] <wikibugs>	 (03CR) 10Gehel: "I'm not quite sure why the alert isn't generated in the test..." [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel)
[09:00:05] <jouncebot>	 brennen and andre: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T0900).
[09:02:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] relforge: Clarify comment about cumin masters role [puppet] - 10https://gerrit.wikimedia.org/r/1207212 (owner: 10Alexandros Kosiaris)
[09:06:46] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[09:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:11:46] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[09:24:18] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798
[09:24:57] <wikibugs>	 (03PS2) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528)
[09:25:38] <wikibugs>	 (03CR) 10Gehel: [C:04-1] "The tests are passing, but the check is done on absolute values, not irate()." [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel)
[09:30:03] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: toolhub: make extraFQDNs specific to codfw, eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/954290
[09:33:57] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: ml-services: add new namespace to prod (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski)
[09:38:57] <wikibugs>	 (03PS2) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798
[09:41:37] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: ml-services: add new namespace to prod (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski)
[09:41:37] <wikibugs>	 (03PS3) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798
[09:43:21] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-services: add revise-tone-task-generator [puppet] - 10https://gerrit.wikimedia.org/r/1207800
[09:44:16] <effie>	 that was quick 
[09:45:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:52:38] <wikibugs>	 (03PS1) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367)
[09:56:07] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T410589)', diff saved to https://phabricator.wikimedia.org/P85408 and previous config saved to /var/cache/conftool/dbconfig/20251120-095606-ladsgroup.json
[09:56:12] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[09:58:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11391188 (10ayounsi) One liners: `lang=python >>> spicerack.redfish('sretest2004').scp_dump().components['NIC.Integrated.1-1-1'].get('Broadcom_LLDPNearestBridg...
[09:59:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi)
[09:59:50] <wikibugs>	 (03PS7) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578)
[10:00:39] <wikibugs>	 (03PS2) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367)
[10:11:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P85409 and previous config saved to /var/cache/conftool/dbconfig/20251120-101114-ladsgroup.json
[10:24:44] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler)
[10:26:23] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P85410 and previous config saved to /var/cache/conftool/dbconfig/20251120-102622-ladsgroup.json
[10:33:25] <wikibugs>	 (03PS2) 10Arnaudb: apt: add an alert on reprepro errors [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835)
[10:33:25] <wikibugs>	 (03CR) 10Arnaudb: "this patch brings new alerts based on metrics introduced in 1205162 and 1206887" [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb)
[10:37:26] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler)
[10:37:35] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[10:39:22] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[10:41:30] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T410589)', diff saved to https://phabricator.wikimedia.org/P85411 and previous config saved to /var/cache/conftool/dbconfig/20251120-104129-ladsgroup.json
[10:41:35] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[10:41:35] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[10:41:43] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T410589)', diff saved to https://phabricator.wikimedia.org/P85412 and previous config saved to /var/cache/conftool/dbconfig/20251120-104142-ladsgroup.json
[10:43:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[10:46:44] <claime>	 hmm
[10:47:05] <claime>	 Oh it's staging
[10:47:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:47:52] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:49:47] <claime>	 arnaudb, effie, just a head's up I'm switching the rest-gateway's backends to dc-local, no expected impact but it will shift some traffic. I'll keep an eye on graphs.
[10:49:58] <effie>	 cool 
[10:49:58] <arnaudb>	 ack, thanks claime 
[10:50:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:50:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:51:31] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132)
[10:52:20] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132)
[10:53:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[10:59:06] <wikibugs>	 (03PS1) 10DCausse: Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602)
[10:59:18] <wikibugs>	 (03PS1) 10DCausse: Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207813 (https://phabricator.wikimedia.org/T410602)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1100)
[11:00:09] <wikibugs>	 (03PS1) 10David Caro: maintain_dbusers: parse the response before throwing [puppet] - 10https://gerrit.wikimedia.org/r/1207814
[11:02:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CirrusSearch] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207813 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse)
[11:02:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse)
[11:09:13] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:12:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:30:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:35:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:35:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612 (10Blake) 03NEW
[11:40:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11391425 (10Clement_Goubert) p:05Triage→03Medium @Kappakayala and @hnowlan being OOO, @mark could I get approval for this please?
[11:40:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:40:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[11:44:54] <wikibugs>	 (03CR) 10FNegri: [C:03+1] maintain_dbusers: parse the response before throwing [puppet] - 10https://gerrit.wikimedia.org/r/1207814 (owner: 10David Caro)
[11:45:23] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah)
[11:45:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:45:51] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "This will lead to more unclear error messages when the error is not valid JSON, unfortunately." [puppet] - 10https://gerrit.wikimedia.org/r/1207814 (owner: 10David Caro)
[11:46:14] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah)
[11:48:10] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mobileapps: revert to 2025-10-13-122439-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) (owner: 10Scott French)
[11:48:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11391457 (10Volans)
[11:48:47] <wikibugs>	 (03PS1) 10Blake: Add blake to ops, remove blake from ops-limited. [puppet] - 10https://gerrit.wikimedia.org/r/1207824 (https://phabricator.wikimedia.org/T410612)
[11:50:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[11:51:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:53:55] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] "Looks good to me :) It handles ns creation on both staging and prod, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/1207800 (owner: 10Dpogorzelski)
[11:55:33] <wikibugs>	 (03PS6) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368
[11:55:34] <wikibugs>	 (03PS7) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367
[11:55:34] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826
[11:56:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:56:33] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7656/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah)
[11:57:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:58:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah)
[11:59:09] <wikibugs>	 (03CR) 10Cathal Mooney: "Nice!  LGTM, I'd say let's test it a bit and get Luca's input when he's back before merging but code looks good to me and makes sense." [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi)
[12:01:36] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7655/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah)
[12:02:02] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:02:24] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826
[12:02:24] <wikibugs>	 (03PS8) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367
[12:02:36] <Lucas_WMDE>	 jouncebot: nowandnext
[12:02:36] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[12:02:36] <jouncebot>	 In 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1300)
[12:02:51] <Lucas_WMDE>	 I’d like to roll out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 if nobody objects (config cleanup, mostly a no-op)
[12:03:56] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7657/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah)
[12:04:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE))
[12:04:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah)
[12:04:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah)
[12:05:18] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826
[12:05:18] <wikibugs>	 (03PS9) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367
[12:05:20] <wikibugs>	 (03Merged) 10jenkins-bot: tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE))
[12:05:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1207171|tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php (T410507)]]
[12:05:59] <stashbot>	 T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507
[12:06:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:07:02] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:07:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[12:08:01] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198939 (https://phabricator.wikimedia.org/T408223)
[12:09:42] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7658/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah)
[12:10:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1207171|tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php (T410507)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:10:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
[12:12:02] <jinxer-wm>	 FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:13:44] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860
[12:13:49] <stashbot>	 T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860
[12:13:54] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Enable Changeprop for revise-tone-task-generator staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538)
[12:14:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207171|tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php (T410507)]] (duration: 08m 57s)
[12:14:55] <stashbot>	 T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507
[12:16:20] * Lucas_WMDE done deploying
[12:16:26] <anzx>	 Lucas_WMDE: thanks again for deployment, i guess resetauthentication script, not needed since there's no IP address 
[12:16:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[12:16:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[12:16:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[12:17:54] <Lucas_WMDE>	 anzx: yeah, it would give the same error as yesterday ^^
[12:18:01] <Lucas_WMDE>	 I also updated the wikitech docs btw
[12:19:14] <anzx>	 thank you 
[12:21:08] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync
[12:21:37] <claime>	 !log roll-restart of mobileapps codfw - T410296
[12:21:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:41] <stashbot>	 T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13 - https://phabricator.wikimedia.org/T410296
[12:22:02] <jinxer-wm>	 FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:22:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync
[12:26:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[12:26:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[12:26:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[12:27:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:27:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[12:28:07] <wikibugs>	 (03PS1) 10JMeybohm: Implement fetching of ipblock-source urls [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1207834
[12:28:58] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Implement fetching of ipblock-source urls [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1207834 (owner: 10JMeybohm)
[12:31:02] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin2002 - T402014"
[12:31:04] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin2002 - T402014
[12:31:06] <stashbot>	 T402014: Add ipblock-source objects and logic - https://phabricator.wikimedia.org/T402014
[12:31:46] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198939 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[12:31:55] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin2002 - T402014
[12:31:57] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin2002 - T402014"
[12:32:43] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11391704 (10cmooney) >>! In T408892#11389081, @Papaul wrote: > I think a am wrong on the public vlan for rack 22. We will not be re-imaging the servers in that...
[12:32:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[12:32:45] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+1] [Growth] Enable Add Link task pool generation for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206948 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm)
[12:37:58] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[12:38:57] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:39:35] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: SystemdUnitCrashLoop (instance grafana2001:9100) - https://phabricator.wikimedia.org/T410619 (10LSobanski) 03NEW
[12:41:26] <wikibugs>	 (03PS1) 10JMeybohm: P:conftool::requestctl_client: update requestctl_cli.original.py [puppet] - 10https://gerrit.wikimedia.org/r/1207844 (https://phabricator.wikimedia.org/T402014)
[12:42:58] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[12:44:34] <wikibugs>	 (03CR) 10LWatson: [C:03+1] Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[12:45:38] <wikibugs>	 (03PS1) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512)
[12:51:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11391814 (10cmooney)
[12:52:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:53:24] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[12:54:41] <wikibugs>	 (03PS1) 10JMeybohm: fetch_external_clouds_vendors_nets.py: ipblock-source support [puppet] - 10https://gerrit.wikimedia.org/r/1207848 (https://phabricator.wikimedia.org/T402014)
[12:54:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler)
[12:56:14] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), and 4 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11391847 (10Clement_Goubert)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1300)
[13:01:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:03:30] <wikibugs>	 (03CR) 10Klausman: ml-services: add new namespace to prod (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski)
[13:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:21:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, only docs to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah)
[13:22:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, modulo I5891d5367 of course" [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah)
[13:23:25] <wikibugs>	 (03PS7) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368
[13:23:25] <wikibugs>	 (03PS4) 10Majavah: P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826
[13:23:25] <wikibugs>	 (03PS10) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367
[13:23:46] <wikibugs>	 (03PS4) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798
[13:25:07] <wikibugs>	 (03CR) 10Majavah: interface::route: Support passing in a CIDR directly (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah)
[13:25:07] <wikibugs>	 (03PS5) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798
[13:25:13] <wikibugs>	 (03CR) 10Majavah: [C:03+2] interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah)
[13:26:24] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah)
[13:27:11] <wikibugs>	 (03PS6) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798
[13:28:23] <wikibugs>	 (03PS3) 10Jcrespo: garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020)
[13:28:59] <wikibugs>	 (03CR) 10Dpogorzelski: ml-services: add new namespace to prod (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski)
[13:29:55] <wikibugs>	 (03CR) 10Jcrespo: garage: Productionize garage (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[13:30:21] <wikibugs>	 (03CR) 10Jcrespo: garage: Productionize garage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[13:30:31] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] mobileapps: revert to 2025-10-13-122439-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) (owner: 10Scott French)
[13:31:49] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: add revise-tone-task-generator [puppet] - 10https://gerrit.wikimedia.org/r/1207800 (owner: 10Dpogorzelski)
[13:32:12] <Amir1>	 jouncebot: nowandnext
[13:32:12] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1300)
[13:32:12] <jouncebot>	 In 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1400)
[13:32:14] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: revert to 2025-10-13-122439-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) (owner: 10Scott French)
[13:33:16] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[13:33:33] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[13:33:56] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[13:34:43] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[13:34:52] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[13:35:34] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[13:42:41] <wikibugs>	 (03PS1) 10Marostegui: data.yaml: Add FIDO key for marostegui [puppet] - 10https://gerrit.wikimedia.org/r/1207863
[13:45:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:46:39] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Offboarding roti [puppet] - 10https://gerrit.wikimedia.org/r/1207864 (https://phabricator.wikimedia.org/T410628)
[13:47:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11392028 (10ayounsi) @papaul, could you have a look at the BIOS of sretest1005 ?  The matching Redfish keys don't exist :( `lang=python >>> dump3.components['N...
[13:50:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[13:51:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11392041 (10cmooney)
[13:51:48] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1207864 (https://phabricator.wikimedia.org/T410628) (owner: 10Slyngshede)
[13:51:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[13:52:17] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hcaptcha_proxy: remove unused parameters [labs/private] - 10https://gerrit.wikimedia.org/r/1207265 (owner: 10Kamila Součková)
[13:52:53] <wikibugs>	 (03PS1) 10Jforrester: Undeploy the WikimediaEditorTasks extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954)
[13:54:27] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1207265 (owner: 10Kamila Součková)
[13:54:45] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding roti [puppet] - 10https://gerrit.wikimedia.org/r/1207864 (https://phabricator.wikimedia.org/T410628) (owner: 10Slyngshede)
[13:56:53] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková)
[13:56:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[13:57:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[13:57:53] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski)
[13:57:55] <Lucas_WMDE>	 I won’t be around during the beginning of today’s backport window btw; I might be able to deploy partway through if needed
[13:58:24] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:59:34] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[14:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1400).
[14:00:04] <jouncebot>	 Daimona, JSherman, and dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:14] <JSherman>	 here
[14:00:23] <sukhe>	 multiple changes pending
[14:00:31] <JSherman>	 ack
[14:00:33] <sukhe>	 slyngs: dpogorzelski: ok to merge yours?
[14:00:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[14:00:44] <Daimona__>	 o/
[14:00:51] <slyngs>	 Go a head
[14:00:56] <sukhe>	 thanks
[14:01:17] <sukhe>	 dpogorzelski: merging your as well
[14:01:33] <slyngs>	 I ping -ml for that one. It's probably fine :-)
[14:02:06] <sukhe>	 it's my fault if it wasn't supposed to be merged. usually we blame Fabrizio for anything but he is not around so it's me 
[14:02:55] <wikibugs>	 (03CR) 10AikoChou: "liftwing_streams.yaml is mainly for testing rendering when we update charts, so adding it there is optional but a good practice :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[14:02:56] <dcausse>	 o/
[14:03:13] <wikibugs>	 (03CR) 10Dbrant: [C:03+1] Undeploy the WikimediaEditorTasks extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester)
[14:03:57] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] site.pp: reimage hcaptcha-proxy1001 to proper role [puppet] - 10https://gerrit.wikimedia.org/r/1207165 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[14:05:15] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková)
[14:05:47] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy1001.wikimedia.org with OS bookworm
[14:05:47] <Amir1>	 I can try doing the backport
[14:06:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11392104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[14:06:09] <wikibugs>	 (03CR) 10Kamila Součková: [V:03+2 C:03+2] hcaptcha_proxy: remove unused parameters [labs/private] - 10https://gerrit.wikimedia.org/r/1207265 (owner: 10Kamila Součková)
[14:06:12] <Amir1>	 Daimona: prepare 🍿
[14:06:16] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Enable $wgCampaignEventsEnableContributionTracking in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206964 (https://phabricator.wikimedia.org/T404904) (owner: 10Daimona Eaytoy)
[14:06:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:06:45] <Daimona>	 ready 🤌
[14:07:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:07:04] <wikibugs>	 (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableContributionTracking in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206964 (https://phabricator.wikimedia.org/T404904) (owner: 10Daimona Eaytoy)
[14:07:13] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[14:09:21] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1206964|Enable $wgCampaignEventsEnableContributionTracking in production (T404904)]]
[14:09:26] <stashbot>	 T404904: Release Collaborative Contributions MVP to all wikis with CampaignEvents extension - NOV 20 - https://phabricator.wikimedia.org/T404904
[14:14:57] <logmsgbot>	 !log ladsgroup@deploy2002 daimona, ladsgroup: Backport for [[gerrit:1206964|Enable $wgCampaignEventsEnableContributionTracking in production (T404904)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:15:01] <stashbot>	 T404904: Release Collaborative Contributions MVP to all wikis with CampaignEvents extension - NOV 20 - https://phabricator.wikimedia.org/T404904
[14:15:29] <Amir1>	 Daimona: live in mwdebug
[14:16:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:17:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:17:46] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy1001.wikimedia.org with reason: host reimage
[14:21:12] <Amir1>	 Daimona: ping :P
[14:21:17] <jinxer-wm>	 RESOLVED: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:21:26] <Daimona>	 Yeah I was testing, it's hard to do it quickly with popcorns in one hand :P
[14:21:31] <Amir1>	 lol
[14:21:36] <Amir1>	 sorry, I didn't get ACK
[14:21:42] <Amir1>	 I thought you didn't get it
[14:22:02] <jinxer-wm>	 RESOLVED: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:22:05] <Daimona>	 No sorry, I should've said so explicitly I just thought it'd be easier
[14:22:16] <Daimona>	 Anyway, it seems to be mostly working
[14:22:18] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[14:22:38] <Daimona>	 The exception being that my edit has not been recorded. Might be that the jobqueue is lagged or something, or perhaps an mwdebug-only thing
[14:23:05] <Daimona>	 Or potentially some cache
[14:23:10] <Daimona>	 I think we can go ahead and I'll test again later
[14:23:25] <Daimona>	 (I don't see any exceptions in logstash either, that's why I'm kinda confident)
[14:23:51] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy1001.wikimedia.org with reason: host reimage
[14:24:11] <logmsgbot>	 !log ladsgroup@deploy2002 daimona, ladsgroup: Continuing with sync
[14:27:31] <wikibugs>	 (03PS3) 10Ayounsi: Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146)
[14:27:54] <wikibugs>	 (03CR) 10Ayounsi: Interface validators: prevent more mistakes on interface naming (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi)
[14:28:15] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206964|Enable $wgCampaignEventsEnableContributionTracking in production (T404904)]] (duration: 18m 53s)
[14:28:19] <stashbot>	 T404904: Release Collaborative Contributions MVP to all wikis with CampaignEvents extension - NOV 20 - https://phabricator.wikimedia.org/T404904
[14:31:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11392187 (10cmooney)
[14:32:01] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207813 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse)
[14:32:06] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse)
[14:33:20] <wikibugs>	 (03Merged) 10jenkins-bot: Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207813 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse)
[14:34:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse)
[14:34:20] <wikibugs>	 (03Merged) 10jenkins-bot: Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse)
[14:34:53] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207813|Fix filtering of relevant default sort suggestions (T410602)]], [[gerrit:1207812|Fix filtering of relevant default sort suggestions (T410602)]]
[14:34:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:34:58] <stashbot>	 T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602
[14:35:20] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "Please enable new features gradually. First testwikis and small wikis, then bigger number of wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[14:36:20] <logmsgbot>	 !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Robert Timm out of all services on: 2413 hosts
[14:36:41] <JSherman>	 Amir1: not all of these wikis have the extension enabled; this is preparatory work that impacts a community config form
[14:37:15] <JSherman>	 Amir1: we have run comms with impacted wikis already
[14:37:17] <Amir1>	 still better to do it small wikis first
[14:37:23] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski)
[14:37:25] <Amir1>	 it's not about comms
[14:37:31] <Amir1>	 it's about bugs and issues
[14:37:58] <JSherman>	 please share which wikis you are concerned about
[14:38:53] <Amir1>	 any wiki listed in https://noc.wikimedia.org/conf/highlight.php?file=dblists/large.dblist
[14:40:14] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, dcausse: Backport for [[gerrit:1207813|Fix filtering of relevant default sort suggestions (T410602)]], [[gerrit:1207812|Fix filtering of relevant default sort suggestions (T410602)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:40:18] <stashbot>	 T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602
[14:40:29] <Amir1>	 dcausse: it's live in mwdebug, can you test?
[14:40:35] <dcausse>	 Amir1: yes, testing
[14:40:41] <Amir1>	 Awesome!
[14:41:01] <dcausse>	 Amir1: all good!
[14:41:20] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, dcausse: Continuing with sync
[14:41:26] <Amir1>	 moving forward \o/
[14:42:03] <JSherman>	 Amir1: I am extremely dissatisfied with this outcome; you are not a listed deployer for this window, and I have the same deploy privileges as you. I could have just self-deployed this
[14:42:19] <JSherman>	 This is a low impact, low risk chane
[14:43:11] <JSherman>	 currently enabled wikis are:
[14:43:11] <JSherman>	 	'testwiki'
[14:43:12] <JSherman>	 	'trwiki'
[14:43:12] <JSherman>	 	'idwiki'
[14:43:12] <JSherman>	 	'ukwiki'
[14:43:12] <JSherman>	 	'viwiki'
[14:43:12] <JSherman>	 	'afwiki'
[14:43:13] <JSherman>	 	'bnwiki'
[14:43:13] <JSherman>	 	'azwiki' 
[14:43:14] <JSherman>	 	'zhwiki
[14:43:14] <JSherman>	 	'eswiki'
[14:43:20] <Amir1>	 https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1400
[14:43:27] <Amir1>	 > Your patch may or may not be deployed at the sole discretion of the deployer 
[14:43:52] <JSherman>	 Sure, but you just jumped in and named yourself deployer
[14:43:54] <Amir1>	 if you have anyone else from the window willing to deploy this, go ahead
[14:44:01] <JSherman>	 I am willing
[14:44:54] <JSherman>	 deploying
[14:44:55] <wikibugs>	 (03PS1) 10Ssingh: Revert "hcaptcha_proxy: remove unused parameters" [puppet] - 10https://gerrit.wikimedia.org/r/1207873
[14:45:02] <Amir1>	 > Deployer: Lucas (Lucas_WMDE), Martin (Urbanecm), Sammy (TheresNoTime) 
[14:45:19] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207813|Fix filtering of relevant default sort suggestions (T410602)]], [[gerrit:1207812|Fix filtering of relevant default sort suggestions (T410602)]] (duration: 10m 25s)
[14:45:19] <Amir1>	 if any of them are happy with deploying, I have no objection
[14:45:23] <stashbot>	 T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602
[14:45:40] <dcausse>	 Amir1: thanks for the deploy!
[14:45:45] <Amir1>	 no worries!
[14:45:46] <Daimona>	 [FTR: I tested my change now that it's live and things are working as expected. Thanks Amir1!]
[14:45:52] <Amir1>	 Awesome!
[14:46:38] <JSherman>	 Amir1: how is it that you can say no, but only they can say yes?
[14:47:01] <wikibugs>	 (03PS1) 10Slyngshede: C:tomcat10 hide stacktrace and server info [puppet] - 10https://gerrit.wikimedia.org/r/1207874
[14:47:04] <Amir1>	 no, it's seeking a third party opinion
[14:47:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11392255 (10bking) a:05bking→03None
[14:47:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11392256 (10bking) a:05bking→03None
[14:48:47] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers hcaptcha1002.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:48:53] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Revert "hcaptcha_proxy: remove unused parameters" [puppet] - 10https://gerrit.wikimedia.org/r/1207873 (owner: 10Ssingh)
[14:49:03] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "hcaptcha_proxy: remove unused parameters" [puppet] - 10https://gerrit.wikimedia.org/r/1207873 (owner: 10Ssingh)
[14:49:22] <logmsgbot>	 !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php enwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize  # T410602 reindexing search suggestions on enwiki
[14:49:24] <logmsgbot>	 !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php enwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize  # T410602 reindexing search suggestions on enwiki
[14:49:43] <JSherman>	 Amir1: I feel extremely surprised and frustrated with this situation. I have never had an unlisted deployer come in and hold up a deployment at the last minute like this before. But, okay. I'll go take a breath. I do appreciate that you are doing what you think is best.
[14:49:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:51:03] <_joe_>	 JSherman: frankly, your attitude is unacceptable.
[14:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:51:35] <wikibugs>	 (03PS1) 10Cathal Mooney: lvs1020: move row C vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207877 (https://phabricator.wikimedia.org/T405609)
[14:51:43] <_joe_>	 if a fellow developer, and in this case someone who is actively getting paged for emergencies, expresses doubts about a deployment, it's not ok to try to bulldoze through it.
[14:51:43] <Amir1>	 I am not unlisted deployer. I'm in the morning list (https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_November_20) and have been deploying for around eight years now. I've taken over since Lucas explicitly that he can't do it today
[14:51:47] <JSherman>	 Amir1: I apologize for my response to this
[14:52:01] <_joe_>	 Amir1: please avoid further interactions, what needed to be said was said
[14:52:11] <Amir1>	 yeah, fair
[14:52:28] <_joe_>	 JSherman: ok, all good then <4
[14:52:32] <_joe_>	 err *<3
[14:53:37] <logmsgbot>	 !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php frwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize  # T410602 reindexing search suggestions on frwiki
[14:53:42] <stashbot>	 T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602
[14:53:48] <JSherman>	 I'll go touch grass, I clearly did not handle this well
[14:54:40] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T410589)', diff saved to https://phabricator.wikimedia.org/P85415 and previous config saved to /var/cache/conftool/dbconfig/20251120-145439-ladsgroup.json
[14:54:45] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[14:54:47] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:54:55] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:55:22] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy1001.wikimedia.org with OS bookworm
[14:55:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11392286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[14:56:58] <wikibugs>	 (03PS1) 10Ladsgroup: Revert^2 "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878
[14:57:03] <Amir1>	 jouncebot: nowandnext
[14:57:04] <jouncebot>	 For the next 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1400)
[14:57:04] <jouncebot>	 In 0 hour(s) and 32 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1530)
[15:00:19] <wikibugs>	 (03PS1) 10Blake: Add a script and systemd unit to monitor for keystore updates. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552)
[15:00:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a script and systemd unit to monitor for keystore updates. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake)
[15:03:36] <wikibugs>	 (03PS2) 10Blake: Add a script and systemd unit to monitor for keystore updates. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552)
[15:04:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a script and systemd unit to monitor for keystore updates. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake)
[15:04:07] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert^2 "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 (owner: 10Ladsgroup)
[15:04:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 (owner: 10Ladsgroup)
[15:05:55] <wikibugs>	 (03PS4) 10Jcrespo: garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020)
[15:06:59] <wikibugs>	 (03CR) 10Tiziano Fogli: "Just tested on Pontoon @cwhite@wikimedia.org, sorry for the lag, it took me a while ... I’ve added an inline comment... and thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite)
[15:08:24] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:47] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P85416 and previous config saved to /var/cache/conftool/dbconfig/20251120-150946-ladsgroup.json
[15:12:00] <wikibugs>	 (03PS1) 10Sergio Gimeno: EventStreamConfig: drop revision and namespace id from contributors.experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177)
[15:13:58] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:14:17] <logmsgbot>	 !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php hewiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize  # T410602 reindexing search suggestions on hewiki
[15:14:21] <stashbot>	 T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602
[15:14:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:18:29] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 (owner: 10Ladsgroup)
[15:18:46] <Lucas_WMDE>	 jouncebot: next
[15:18:46] <jouncebot>	 In 0 hour(s) and 11 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1530)
[15:19:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11392432 (10cmooney)
[15:19:01] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207878|Revert^2 "rdbms: Dismantle concept of groups"]]
[15:19:21] <logmsgbot>	 !log ladsgroup@deploy2002 sync-world failed: <CalledProcessError> Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.RV8eoygq6j']' returned
[15:19:21] <logmsgbot>	 non-zero exit status 255. (scap version: 4.227.0) (duration: 00m 20s)
[15:19:47] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Revert^2 "rdbms: Dismantle concept of groups"" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207884
[15:19:47] <wikibugs>	 (03CR) 10TrainBranchBot: "ladsgroup@deploy2002 created a revert of this change as I12ca322e5f483632714b196ee67c6849ab2bc9d6" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 (owner: 10Ladsgroup)
[15:19:55] <Lucas_WMDE>	 :/
[15:20:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207884 (owner: 10TrainBranchBot)
[15:22:46] <Amir1>	 It's okay 😁
[15:23:28] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11392495 (10MatthewVernon)
[15:24:55] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P85417 and previous config saved to /var/cache/conftool/dbconfig/20251120-152454-ladsgroup.json
[15:25:20] <wikibugs>	 (03PS2) 10JMeybohm: fetch_external_clouds_vendors_nets.py: ipblock-source support [puppet] - 10https://gerrit.wikimedia.org/r/1207848 (https://phabricator.wikimedia.org/T402014)
[15:26:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[15:26:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[15:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1530)
[15:30:39] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:31:26] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:32:35] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:33:21] <wikibugs>	 (03PS5) 10Jcrespo: garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020)
[15:33:21] <wikibugs>	 (03PS1) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020)
[15:33:24] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[15:34:20] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:34:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert^2 "rdbms: Dismantle concept of groups"" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207884 (owner: 10TrainBranchBot)
[15:34:40] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11392629 (10MatthewVernon)
[15:34:47] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[15:35:06] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207884|Revert "Revert^2 "rdbms: Dismantle concept of groups""]]
[15:35:11] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[15:35:43] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[15:36:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11392634 (10cmooney)
[15:36:43] <wikibugs>	 (03PS2) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020)
[15:37:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[15:38:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[15:39:01] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-services: fix ns creation parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207888
[15:39:09] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: fix ns creation parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207888 (owner: 10Dpogorzelski)
[15:39:18] <logmsgbot>	 !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php frwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize  # T410602 reindexing search suggestions on frwiki
[15:39:23] <stashbot>	 T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602
[15:39:24] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, trainbranchbot: Backport for [[gerrit:1207884|Revert "Revert^2 "rdbms: Dismantle concept of groups""]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:40:02] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, trainbranchbot: Continuing with sync
[15:40:02] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T410589)', diff saved to https://phabricator.wikimedia.org/P85418 and previous config saved to /var/cache/conftool/dbconfig/20251120-154002-ladsgroup.json
[15:40:07] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[15:40:08] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[15:40:19] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T410589)', diff saved to https://phabricator.wikimedia.org/P85419 and previous config saved to /var/cache/conftool/dbconfig/20251120-154014-ladsgroup.json
[15:40:58] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix ns creation parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207888 (owner: 10Dpogorzelski)
[15:43:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[15:44:17] <wikibugs>	 (03PS3) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020)
[15:44:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[15:45:00] <wikibugs>	 (03PS1) 10Cathal Mooney: lvs1019: move row D vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207891 (https://phabricator.wikimedia.org/T405628)
[15:45:16] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207884|Revert "Revert^2 "rdbms: Dismantle concept of groups""]] (duration: 10m 09s)
[15:46:05] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "Makes sense to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[15:46:14] <wikibugs>	 (03PS4) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020)
[15:46:22] <JSherman>	 Amir1: I want to be a little more specific in my apology from earlier; I let my frustration get the best of me and I did not respond to you with humility and an open mind. I was not following the golden rule! I was holding on to the idea that I was right and you were wrong. Sitting in the deployer seat means saying no to things sometimes. I can't go back in time and respond better, but I can do better next time.
[15:46:38] <wikibugs>	 (03PS2) 10Cathal Mooney: lvs1019: move row D vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207891 (https://phabricator.wikimedia.org/T405628)
[15:47:28] <wikibugs>	 (03CR) 10Clare Ming: [C:03+1] EventStreamConfig: drop revision and namespace id from contributors.experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[15:48:03] <wikibugs>	 (03PS1) 10Bking: opensearch on k8s: stop hard-coding JVM memory settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207893 (https://phabricator.wikimedia.org/T405361)
[15:49:03] <wikibugs>	 (03PS2) 10Bking: opensearch on k8s: stop hard-coding JVM memory settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207893 (https://phabricator.wikimedia.org/T405361)
[15:50:07] <wikibugs>	 (03CR) 10Scott French: [C:03+1] cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[15:52:39] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:53:20] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:55:20] <wikibugs>	 (03CR) 10Gehel: [C:03+1] opensearch on k8s: stop hard-coding JVM memory settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207893 (https://phabricator.wikimedia.org/T405361) (owner: 10Bking)
[15:55:37] <wikibugs>	 (03PS1) 10Cwhite: loki: remove loki rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619)
[15:55:43] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@4df475f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4df475f3]
[15:55:47] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch on k8s: stop hard-coding JVM memory settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207893 (https://phabricator.wikimedia.org/T405361) (owner: 10Bking)
[15:56:45] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@4df475f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4df475f3] (duration: 01m 01s)
[15:57:01] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@4df475f]: Regular analytics weekly train [analytics/refinery@4df475f3]
[15:57:17] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[15:57:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] loki: remove loki rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619) (owner: 10Cwhite)
[15:58:23] <wikibugs>	 (03PS2) 10Cwhite: loki: remove loki rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619)
[15:58:49] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] kubernetes: Set default Envoy version to 1.32.12 [puppet] - 10https://gerrit.wikimedia.org/r/1207289 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus)
[15:59:26] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@4df475f]: Regular analytics weekly train [analytics/refinery@4df475f3] (duration: 02m 25s)
[16:00:05] <jouncebot>	 brennen and andre: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1600).
[16:00:26] <logmsgbot>	 !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php hewiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize  # T410602 reindexing search suggestions on hewiki
[16:00:30] <stashbot>	 T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602
[16:01:17] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11392787 (10Jhancock.wm) logged in to idrac to check. so far so good. if it doesn't alert by monday, we should be able to close the ticket.
[16:03:26] <wikibugs>	 (03PS1) 10Ssingh: hiera: hcaptcha/proxy: fix healthchecks for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1207896 (https://phabricator.wikimedia.org/T409780)
[16:03:59] <logmsgbot>	 !log mforns@deploy2002 Started deploy [analytics/refinery@4df475f] (thin): Regular analytics weekly train THIN [analytics/refinery@4df475f3]
[16:05:16] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [analytics/refinery@4df475f] (thin): Regular analytics weekly train THIN [analytics/refinery@4df475f3] (duration: 01m 16s)
[16:07:23] <wikibugs>	 (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619) (owner: 10Cwhite)
[16:10:54] <wikibugs>	 (03PS1) 10Ssingh: Revert "hcaptcha_proxy: remove unused parameters" [labs/private] - 10https://gerrit.wikimedia.org/r/1207897
[16:12:22] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[16:14:07] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] "PCC LGTM: https://puppet-compiler.wmflabs.org/output/1207894/5335/" [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619) (owner: 10Cwhite)
[16:16:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: lvs1018: decom links to asw2-c2-eqiad and asw2-d7-eqiad - https://phabricator.wikimedia.org/T410661 (10cmooney) 03NEW p:05Triage→03Medium
[16:16:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: lvs1018: decom links to asw2-c2-eqiad and asw2-d7-eqiad - https://phabricator.wikimedia.org/T410661#11392925 (10cmooney)
[16:16:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11392926 (10cmooney)
[16:16:42] <papaul>	 !log rebooting sretest1005 to chek LLDP settings 
[16:16:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:09] <icinga-wm>	 PROBLEM - Host sretest1005 is DOWN: PING CRITICAL - Packet loss = 100%
[16:18:16] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "This is ready to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[16:18:18] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[16:18:34] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11392932 (10MatthewVernon)
[16:19:43] <icinga-wm>	 RECOVERY - Host sretest1005 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[16:22:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11392975 (10Papaul) @ayounsi  sretest1005 is the same as 2004 see below. what you can maybe check is the redfish /IDRAC version on sretest2004 and 1005   {F703...
[16:28:45] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11393052 (10MatthewVernon)
[16:29:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11393076 (10RobH) I've set a gcal event for 2025-12003 @ 10AM EST / 15:00 GMT for the alert1002 migration.
[16:31:30] <wikibugs>	 (03PS1) 10Ssingh: P:hcaptcha::proxy: do not restart nginx (do reload) [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780)
[16:32:03] <wikibugs>	 07sre-alert-triage, 10Observability-Logging, 06SRE Observability (FY2025/2026-Q2): Alert in need of triage: SystemdUnitCrashLoop (instance grafana2001:9100) - https://phabricator.wikimedia.org/T410619#11393094 (10colewhite) 05Open→03Resolved a:03colewhite I've removed the rsync job that I suspect w...
[16:32:28] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] Revert "hcaptcha_proxy: remove unused parameters" [labs/private] - 10https://gerrit.wikimedia.org/r/1207897 (owner: 10Ssingh)
[16:32:32] <wikibugs>	 (03PS2) 10Stevemunene: LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222)
[16:33:46] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[16:33:58] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7661/console" [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[16:35:42] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1005.eqiad.wmnet
[16:35:42] <wikibugs>	 (03CR) 10Ssingh: "Yeah sorry Ben. I will take care of this today." [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene)
[16:37:36] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <INSERT LDAP GROUP> for <INSERT USERNAME> - https://phabricator.wikimedia.org/T410667 (10hold_your_horses) 03NEW
[16:37:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to GitLab for holdyourhorses - https://phabricator.wikimedia.org/T410667#11393129 (10hold_your_horses)
[16:38:57] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:43:09] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1005.eqiad.wmnet
[16:43:11] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] hiera: hcaptcha/proxy: fix healthchecks for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1207896 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[16:43:46] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[16:46:55] <wikibugs>	 (03PS1) 10Ssingh: site.pp: reimage all new hcaptcha proxies to the right role [puppet] - 10https://gerrit.wikimedia.org/r/1207908
[16:48:03] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[16:48:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: hcaptcha/proxy: fix healthchecks for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1207896 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[16:49:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11393301 (10Clement_Goubert) @KOfori Could you approve this ?
[16:49:09] <wikibugs>	 (03CR) 10Dzahn: gerrit: add a local backup cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[16:49:17] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[16:49:44] <wikibugs>	 (03CR) 10Dzahn: "thanks, well. let me add Mateus first :)" [puppet] - 10https://gerrit.wikimedia.org/r/1207304 (owner: 10Dzahn)
[16:49:56] <wikibugs>	 (03CR) 10Dzahn: admin: deprecate the releasers-blubber group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn)
[16:50:04] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] P:hcaptcha::proxy: do not restart nginx (do reload) [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[16:50:37] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] site.pp: reimage all new hcaptcha proxies to the right role [puppet] - 10https://gerrit.wikimedia.org/r/1207908 (owner: 10Ssingh)
[16:52:37] <wikibugs>	 (03PS4) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727)
[16:53:11] <icinga-wm>	 PROBLEM - Host sretest1005 is DOWN: PING CRITICAL - Packet loss = 100%
[16:54:41] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1006.eqiad.wmnet
[16:54:53] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloudgw: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1207912
[16:54:53] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloudgw: Cleanup natlog feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1207913
[16:54:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[16:54:59] <robh>	 !log draining eqiad d3 wikikube hosts for network migration
[16:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:54] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1034.eqiad.wmnet
[16:56:12] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1159,1162-1163].eqiad.wmnet
[16:56:29] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1034.eqiad.wmnet
[16:56:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393364 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1034.eqiad.wmnet completed: - wikikube-worke...
[16:56:39] <icinga-wm>	 RECOVERY - Host sretest1005 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[16:58:00] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1159,1162-1163].eqiad.wmnet
[16:58:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393373 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1159,1162-1163].eqiad.wmnet completed: - wi...
[16:58:15] <wikibugs>	 (03PS2) 10Dzahn: admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313
[16:58:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11393374 (10ayounsi) Thanks, yeah that must be the reason : ` >>> spicerack.redfish('sretest1005').hw_model 9 >>> spicerack.redfish('sretest2004').hw_model 9 >...
[16:59:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn)
[17:00:05] <jouncebot>	 jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1700).
[17:00:06] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:17] <wikibugs>	 (03PS3) 10Dzahn: admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313
[17:01:48] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1006.eqiad.wmnet
[17:02:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:03:38] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1159.eqiad.wmnet with reason: C/D Migration
[17:04:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[17:05:53] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:hcaptcha::proxy: do not restart nginx (do reload) [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[17:06:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[17:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:10:02] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1162.eqiad.wmnet with reason: C/D Migration
[17:10:03] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1163.eqiad.wmnet with reason: C/D Migration
[17:10:52] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1034.eqiad.wmnet with reason: C/D Migration
[17:11:34] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] site.pp: reimage all new hcaptcha proxies to the right role [puppet] - 10https://gerrit.wikimedia.org/r/1207908 (owner: 10Ssingh)
[17:11:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[17:12:16] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1034.eqiad.wmnet
[17:12:19] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1034.eqiad.wmnet
[17:12:23] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1159,1162-1163].eqiad.wmnet
[17:12:28] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1159,1162-1163].eqiad.wmnet
[17:12:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393442 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1034.eqiad.wmnet completed: - wikikube-worker1...
[17:12:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393444 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1159,1162-1163].eqiad.wmnet completed: - wiki...
[17:12:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[17:13:48] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy1002.wikimedia.org with OS bookworm
[17:14:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[17:14:16] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy2001.wikimedia.org with OS bookworm
[17:14:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[17:14:36] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy2002.wikimedia.org with OS bookworm
[17:14:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[17:15:22] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet
[17:15:28] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet
[17:16:05] <robh>	 !log eqiad wikikube d3 repooled, depooling d8 wikikube hosts
[17:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:57] <urbanecm>	 robh: i see some wikikube operations, is it a good idea to do a mediawiki deployment now? or should i wait?
[17:17:14] <robh>	 my understanding is it shouldn't matter but maybe wait until the depool commands complete
[17:17:22] <robh>	 shouldn't be more htan 5 minutes
[17:17:32] <robh>	 can ping you when its done, i see no issue doing a deploy once they're fully depooled
[17:17:42] <robh>	 its 1 rack out of 8 racks of wikikube heh
[17:17:46] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1207915 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[17:17:58] <urbanecm>	 okay, will do
[17:18:03] <urbanecm>	 ty!
[17:18:14] <robh>	 the waiting for depool is likely me being paranoid but thats part of the job description, thanks for checking!
[17:18:29] <urbanecm>	 given it should be 5 mins, i'll start the CI though
[17:18:35] <robh>	 definitely
[17:18:40] <urbanecm>	 but i'll wait for ping before actually touching prod
[17:18:53] <wikibugs>	 (03PS1) 10Urbanecm: hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207919 (https://phabricator.wikimedia.org/T410666)
[17:19:01] <wikibugs>	 (03PS1) 10Urbanecm: hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207920 (https://phabricator.wikimedia.org/T410666)
[17:19:08] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207920 (https://phabricator.wikimedia.org/T410666) (owner: 10Urbanecm)
[17:19:12] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207919 (https://phabricator.wikimedia.org/T410666) (owner: 10Urbanecm)
[17:20:08] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet
[17:20:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393479 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet...
[17:20:19] <robh>	 thats 1 of 2 depools done
[17:20:42] <robh>	 i split my command into 2 cuz i didn't feel like making a regex expression 100 characters long ; D
[17:20:57] <robh>	 sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad depool wikikube-worker10[04,19,20,37,67,68,69,70,71,96,97].eqiad.wmnet is long enough =P
[17:20:57] <stashbot>	 T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950
[17:21:42] <robh>	 wikikube-worker1037 is being slow to evict all its pods
[17:22:16] <urbanecm>	 no worries, still waiting :)
[17:22:23] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet
[17:22:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393495 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eq...
[17:22:30] <robh>	 urbanecm: ^ go for it =]
[17:22:35] <robh>	 depools complete
[17:22:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:22:51] <urbanecm>	 ty! have an ETA 6 on CI :)
[17:22:56] <urbanecm>	 do you want me to ping you once i'm done?
[17:23:34] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1167.eqiad.wmnet with reason: C/D Migration
[17:24:34] <wikibugs>	 (03PS5) 10Jsn.sherman: Enable rr-ml AutoModerator CC form on !large wikis Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[17:24:42] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1168.eqiad.wmnet with reason: C/D Migration
[17:25:02] <wikibugs>	 (03PS1) 10Ssingh: hiera: common.yaml: add hcaptcha to all sites [puppet] - 10https://gerrit.wikimedia.org/r/1207922 (https://phabricator.wikimedia.org/T409780)
[17:25:33] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1107.eqiad.wmnet with reason: C/D Migration
[17:26:12] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1107.eqiad.wmnet with reason: C/D Migration
[17:26:27] <wikibugs>	 (03CR) 10Jsn.sherman: "Alrighty, I've removed large wikis with `while read -r db; do composer manage-dblist del $db revertrisk-multilingual; done < dblists/large" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[17:26:42] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1108.eqiad.wmnet with reason: C/D Migration
[17:27:45] <wikibugs>	 (03CR) 10Urbanecm: "question: if the goal is to enable this on all non-large wikis, why is this not a dbexpr instead? that way, you can formulate something li" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[17:27:49] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy1002.wikimedia.org with reason: host reimage
[17:28:31] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1109.eqiad.wmnet with reason: C/D Migration
[17:29:23] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1110.eqiad.wmnet with reason: C/D Migration
[17:30:35] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1004.eqiad.wmnet with reason: C/D Migration
[17:31:52] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1019.eqiad.wmnet with reason: C/D Migration
[17:32:03] <wikibugs>	 (03Merged) 10jenkins-bot: hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207920 (https://phabricator.wikimedia.org/T410666) (owner: 10Urbanecm)
[17:32:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:32:52] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1096.eqiad.wmnet with reason: C/D Migration
[17:32:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[17:33:01] <wikibugs>	 (03Merged) 10jenkins-bot: hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207919 (https://phabricator.wikimedia.org/T410666) (owner: 10Urbanecm)
[17:33:27] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy2002.wikimedia.org with reason: host reimage
[17:33:29] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy1002.wikimedia.org with reason: host reimage
[17:33:30] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy2001.wikimedia.org with reason: host reimage
[17:34:09] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1097.eqiad.wmnet with reason: C/D Migration
[17:36:38] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1020.eqiad.wmnet with reason: C/D Migration
[17:36:42] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1207920|hotfix: Disable Urdu alias for Special:Homepage (T410666)]], [[gerrit:1207919|hotfix: Disable Urdu alias for Special:Homepage (T410666)]]
[17:36:46] <stashbot>	 T410666: Visiting Special:Homepage is not recognised on urwiki - https://phabricator.wikimedia.org/T410666
[17:36:51] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks for highlighting that enabling the downstream connection limit is new." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203194 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus)
[17:37:15] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy2002.wikimedia.org with reason: host reimage
[17:37:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:37:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[17:38:39] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1037.eqiad.wmnet with reason: C/D Migration
[17:39:42] <wikibugs>	 (03CR) 10CDanis: Hive: alert when query rate is too high (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel)
[17:39:59] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1164.eqiad.wmnet with reason: C/D Migration
[17:41:03] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy2001.wikimedia.org with reason: host reimage
[17:42:13] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1165.eqiad.wmnet with reason: C/D Migration
[17:42:23] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1067.eqiad.wmnet with reason: C/D Migration
[17:42:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:44:01] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1068.eqiad.wmnet with reason: C/D Migration
[17:44:22] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1069.eqiad.wmnet with reason: C/D Migration
[17:45:16] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1070.eqiad.wmnet with reason: C/D Migration
[17:45:47] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1071.eqiad.wmnet with reason: C/D Migration
[17:45:50] <wikibugs>	 06SRE, 10Phabricator: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent - https://phabricator.wikimedia.org/T410572#11393654 (10Volans) Adding #collaboration-services
[17:46:11] <wikibugs>	 (03PS1) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438)
[17:46:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[17:48:54] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Pleasantly surprised to see the stock admin config already adds `ignore_global_conn_limit`, so you don't have to deal with that here." [puppet] - 10https://gerrit.wikimedia.org/r/1203195 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus)
[17:49:01] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy1002.wikimedia.org with OS bookworm
[17:49:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393663 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[17:50:09] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks for updating this!" [puppet] - 10https://gerrit.wikimedia.org/r/1207844 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm)
[17:50:50] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet
[17:50:51] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet
[17:50:57] <robh>	 !log wikikube migrations in eqiad complete, repooling d8 
[17:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:59] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet
[17:51:01] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet
[17:51:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393676 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet co...
[17:51:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393677 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqia...
[17:51:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[17:53:13] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aux-k8s-worker1007.eqiad.wmnet with reason: C/D Migration
[17:53:40] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy2002.wikimedia.org with OS bookworm
[17:53:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393679 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[17:54:03] <swfrench-wmf>	 jouncebot: nowandnext
[17:54:04] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1700)
[17:54:04] <jouncebot>	 In 0 hour(s) and 5 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1800)
[17:54:04] <jouncebot>	 In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1800)
[17:54:48] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aux-k8s-worker1006.eqiad.wmnet with reason: C/D Migration
[17:55:00] <urbanecm>	 swfrench-wmf: note i am (still) running scap :-/ got hit by the full image build :-/
[17:56:24] <swfrench-wmf>	 urbanecm: ah, thanks for the heads-up! my changes should be fairly quick (i.e., don't require the full window), so just keep me posted :)
[17:56:36] <urbanecm>	 swfrench-wmf: i'll ping you once done!
[17:58:15] <wikibugs>	 (03PS4) 10Bking: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse)
[17:58:30] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.mysql.parsercache
[17:58:31] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99)
[17:59:34] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy2001.wikimedia.org with OS bookworm
[17:59:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[17:59:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[18:00:05] <jouncebot>	 bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1800).
[18:00:05] <jouncebot>	 swfrench-wmf: May I have your attention please! MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1800)
[18:01:06] <swfrench-wmf>	 FYI, holding for now. I'll be deploying scap and then deploying with said scap :)
[18:02:22] <bd808>	 I don't have anything to deploy this week swfrench-wmf
[18:04:34] <swfrench-wmf>	 bd808: ack, thanks!
[18:04:50] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1207920|hotfix: Disable Urdu alias for Special:Homepage (T410666)]], [[gerrit:1207919|hotfix: Disable Urdu alias for Special:Homepage (T410666)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:04:54] <stashbot>	 T410666: Visiting Special:Homepage is not recognised on urwiki - https://phabricator.wikimedia.org/T410666
[18:05:26] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[18:05:33] <urbanecm>	 patch "fixes" the problem, poceeding
[18:05:35] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11393717 (10RobH) Update:  @Ladsgroup had other things going on and wasn't able to do this today but did link me to the directions on how to depool: https:...
[18:05:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: common.yaml: add hcaptcha to all sites [puppet] - 10https://gerrit.wikimedia.org/r/1207922 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[18:08:54] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3001.wikimedia.org with OS bookworm
[18:09:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[18:09:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393719 (10RobH) Please note all wikikube workers have been migrated and we're now down to only 4 hosts left with #serviceops to migrate:  wikikube-ctrl1003 kafka-main1008 kafk...
[18:09:26] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS bookworm
[18:09:30] <stashbot>	 sukhe@cumin1003: Failed to log message to wiki. Somebody should check the error logs.
[18:09:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[18:09:47] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4001.wikimedia.org with OS bookworm
[18:09:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[18:10:02] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11393725 (10Marostegui) @robh I'm out and not near a keyboard but you have to replace pc1016 with pc6
[18:11:17] <wikibugs>	 (03PS1) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681)
[18:11:24] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4002.wikimedia.org with OS bookworm
[18:11:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393740 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[18:11:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:11:52] <wikibugs>	 (03PS2) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681)
[18:12:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:12:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester)
[18:12:42] <wikibugs>	 (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7662/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:12:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:12:53] <wikibugs>	 (03PS1) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438)
[18:13:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle)
[18:13:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11393756 (10RobH) Day 8 Update:  * 22 hosts moved today, 22 remain ** all wikikube and aux host migrations completed ** (3) pc hosts in disucssion with data-p...
[18:13:55] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy5001.wikimedia.org with OS bookworm
[18:14:03] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy5002.wikimedia.org with OS bookworm
[18:14:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393760 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[18:14:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[18:15:15] <wikibugs>	 (03CR) 10Scott French: [C:03+1] fetch_external_clouds_vendors_nets.py: ipblock-source support [puppet] - 10https://gerrit.wikimedia.org/r/1207848 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm)
[18:16:10] <wikibugs>	 (03PS3) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681)
[18:16:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:16:58] <wikibugs>	 (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7663/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:17:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11393787 (10RobH) 05Open→03Resolved All #infrastructure-foundations hosts in eqiad c/d rows migrated to the new switch stacks.
[18:17:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:18:24] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:18:39] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207920|hotfix: Disable Urdu alias for Special:Homepage (T410666)]], [[gerrit:1207919|hotfix: Disable Urdu alias for Special:Homepage (T410666)]] (duration: 41m 57s)
[18:18:43] <stashbot>	 T410666: Visiting Special:Homepage is not recognised on urwiki - https://phabricator.wikimedia.org/T410666
[18:18:48] <sukhe>	 yeah the drmrs one I take repsonsiblity for. reimage happening soon. non-prodhost, completely fine to ignore
[18:18:51] <urbanecm>	 swfrench-wmf: (finally) done!
[18:19:10] <swfrench-wmf>	 urbanecm: thank you!
[18:21:47] <sukhe>	 !log sudo cumin 'A:lvs-eqiad or A:lvs-codfw' 'disable-puppet "set druid-coordinator to state lvs_setup"'
[18:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:14] <logmsgbot>	 !log swfrench@deploy2002 Installing scap version "4.228.0" for 2 host(s)
[18:22:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:22:53] <wikibugs>	 (03PS4) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681)
[18:23:14] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene)
[18:23:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:23:42] <wikibugs>	 (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7664/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:24:01] <logmsgbot>	 !log swfrench@deploy2002 Installation of scap version "4.228.0" completed for 2 hosts
[18:24:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[18:25:46] <wikibugs>	 (03PS5) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681)
[18:26:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:26:32] <wikibugs>	 (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7665/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[18:26:42] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run after switching scap mwscript to PHP 8.3 - T405955
[18:26:46] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:27:00] <sukhe>	 !log sukhe@lvs1020:~$ sudo systemctl restart pybal.service: T406222
[18:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:04] <stashbot>	 T406222: Add druid coordinator service to LVS for the druid_public cluster. - https://phabricator.wikimedia.org/T406222
[18:27:14] <logmsgbot>	 !log swfrench@deploy2002 Stopping before sync operations
[18:27:44] <wikibugs>	 (03PS2) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438)
[18:28:46] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Normal scap run after switching scap mwscript to PHP 8.3 - T405955
[18:29:07] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:29:13] <sukhe>	 ok
[18:29:18] <sukhe>	 that is me, looking
[18:29:27] <sukhe>	 this change https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/195335854fd83138604462f54264e0de4b3c8daf
[18:29:35] <sukhe>	 puppet is disabled everywhere else, so no worries
[18:29:40] <sukhe>	 (on LVS'es)
[18:29:57] * swfrench-wmf thumbs up
[18:29:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[18:33:35] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4001.wikimedia.org with reason: host reimage
[18:33:50] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy3002.wikimedia.org with reason: host reimage
[18:33:58] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4002.wikimedia.org with reason: host reimage
[18:33:59] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy3001.wikimedia.org with reason: host reimage
[18:34:20] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Normal scap run after switching scap mwscript to PHP 8.3 - T405955 (duration: 05m 34s)
[18:34:25] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:34:34] <sukhe>	 weird, the service is definitely known to IPVS
[18:34:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[18:35:03] <wikibugs>	 (03PS3) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438)
[18:35:58] <sukhe>	 TCP  10.2.2.15:8081 mh (mh-port)
[18:36:52] <wikibugs>	 (03PS5) 10Bking: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse)
[18:37:04] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:37:18] <sukhe>	 ^ yeah that's fine, will be reimaged soon, non prod
[18:37:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:38:51] <wikibugs>	 (03PS2) 10Scott French: De-configure cookie-based enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955)
[18:39:41] <wikibugs>	 (03PS2) 10Ahmon Dancy: scap.cfg.erb: Restore beta cluster php_fpm_restart_script setting [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166)
[18:39:51] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4001.wikimedia.org with reason: host reimage
[18:39:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[18:40:00] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[18:42:04] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:42:35] <sukhe>	 ah I see what's happening with the LVS change
[18:42:36] <sukhe>	 ok reverting
[18:42:40] <wikibugs>	 (03CR) 10Bearloga: [C:03+2] EventStreamConfig: drop revision and namespace id from contributors.experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[18:43:25] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: drop revision and namespace id from contributors.experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[18:44:13] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy3002.wikimedia.org with reason: host reimage
[18:46:06] <wikibugs>	 (03PS1) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438)
[18:46:34] <wikibugs>	 (03PS1) 10Ssingh: Revert "LVS: set druid-coordinator to state lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/1207933
[18:46:51] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy3001.wikimedia.org with reason: host reimage
[18:49:44] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4002.wikimedia.org with reason: host reimage
[18:51:08] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:51:52] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:52:42] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:52:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:52:52] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:55:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:58:24] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:59:46] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4001.wikimedia.org with OS bookworm
[18:59:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[19:00:04] <jouncebot>	 brennen and andre: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1900).
[19:00:36] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy5001.wikimedia.org with reason: host reimage
[19:00:49] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy5002.wikimedia.org with reason: host reimage
[19:01:10] <wikibugs>	 (03PS6) 10Bking: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse)
[19:01:58] <andre>	 brennen: FYI T409743 seems to have gotten sorted out already
[19:01:58] <stashbot>	 T409743: English Wikibooks main page subpages under cascading protection are editable by anyone, and MP stylesheets do not display protection messages to non-admins - https://phabricator.wikimedia.org/T409743
[19:02:09] <brennen>	 o/
[19:02:13] <brennen>	 andre: thx
[19:02:27] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy3002.wikimedia.org with OS bookworm
[19:02:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[19:03:13] <wikibugs>	 (03CR) 10Jsn.sherman: "@murbanec@wikimedia.org that sounds like a good solution, but I'm loath to use set logic in this repo unless I am really clear on how it w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[19:03:24] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job nginx in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:03:40] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy5001.wikimedia.org with reason: host reimage
[19:04:03] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] "From my read of https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/master/modules/ext.wikimediaEvents/phpEngine.js I li" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:04:15] <wikibugs>	 (03CR) 10Jsn.sherman: "> the plan is to add this config to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[19:04:44] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy3001.wikimedia.org with OS bookworm
[19:04:52] <brennen>	 !log 1.46.0-wmf.3 train status (T408273): no current blockers, rolling to all wikis
[19:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:57] <stashbot>	 T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273
[19:05:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[19:05:25] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207945 (https://phabricator.wikimedia.org/T408273)
[19:05:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207945 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot)
[19:06:21] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207945 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot)
[19:07:27] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy5002.wikimedia.org with reason: host reimage
[19:08:35] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4002.wikimedia.org with OS bookworm
[19:08:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[19:09:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[19:11:30] <wikibugs>	 (03PS2) 10Aaron Schulz: Mark non-wikimedia.org math APIs as deprecated in the sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773)
[19:16:28] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy6001.wikimedia.org with OS bookworm
[19:16:31] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy6002.wikimedia.org with OS bookworm
[19:16:35] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.3  refs T408273
[19:16:36] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7001.wikimedia.org with OS bookworm
[19:16:40] <stashbot>	 T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273
[19:16:41] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7002.wikimedia.org with OS bookworm
[19:16:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393895 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[19:16:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393896 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[19:16:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[19:16:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393899 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st...
[19:19:15] <wikibugs>	 (03PS1) 10Brennen Bearnes: Do not pass callback arguments to incompatible method [extensions/GlobalPreferences] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207950 (https://phabricator.wikimedia.org/T410551)
[19:20:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393903 (10Scott_French) @RobH - Confirming conf1009 for 2025-11-24, but the SRE staff meeting runs from 17:00 - 18:00 UTC. I'd suggest starting no later than 16:30 (and pausin...
[19:21:47] <wikibugs>	 (03PS6) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681)
[19:22:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393911 (10RobH) >>! In T405950#11393903, @Scott_French wrote: > @RobH - Confirming conf1009 for 2025-11-24, but the SRE staff meeting runs from 17:00 - 18:00 UTC. I'd suggest...
[19:22:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:23:24] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:24:56] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy5001.wikimedia.org with OS bookworm
[19:25:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[19:27:45] <jinxer-wm>	 RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:28:24] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:29:04] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy5002.wikimedia.org with OS bookworm
[19:29:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[19:32:22] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "LVS: set druid-coordinator to state lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/1207933 (owner: 10Ssingh)
[19:32:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[19:33:24] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:35:26] <sukhe>	 !log sukhe@lvs1020:~$ sudo systemctl restart pybal.service 
[19:35:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393935 (10Scott_French) 18:15 UTC sounds good to me. Thank you!
[19:37:00] <sukhe>	 !log sudo cumin 'A:lvs-eqiad or A:lvs-codfw' 'run-puppet-agent --enable "set druid-coordinator to state lvs_setup"'
[19:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[19:40:54] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[19:43:01] <swfrench-wmf>	 brennen: once the train is clear and logs are clean, would it be alright if I sneak in a backport that didn't fit into the infra window earlier this morning?
[19:43:43] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy6001.wikimedia.org with reason: host reimage
[19:43:45] <brennen>	 swfrench-wmf: go ahead - things are looking ok now.
[19:43:58] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy7002.wikimedia.org with reason: host reimage
[19:44:01] <swfrench-wmf>	 brennen: great, thank you!
[19:44:04] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy7001.wikimedia.org with reason: host reimage
[19:44:20] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy6002.wikimedia.org with reason: host reimage
[19:45:12] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse)
[19:46:06] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:46:14] <wikibugs>	 (03CR) 10Bking: [C:03+2] apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse)
[19:46:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:47:16] <wikibugs>	 (03Merged) 10jenkins-bot: De-configure cookie-based enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:47:36] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1204948|De-configure cookie-based enrollment in PHP 8.3 (T405955)]]
[19:47:41] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[19:47:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[19:48:47] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy6001.wikimedia.org with reason: host reimage
[19:52:12] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1204948|De-configure cookie-based enrollment in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:52:44] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy7002.wikimedia.org with reason: host reimage
[19:53:35] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[19:53:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[19:56:11] <Reedy>	 jouncebot: nowandnext
[19:56:11] <jouncebot>	 For the next 1 hour(s) and 3 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1900)
[19:56:11] <jouncebot>	 In 1 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T2100)
[19:56:30] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy7001.wikimedia.org with reason: host reimage
[19:56:35] <wikibugs>	 (03PS1) 10Reedy: AccountRecovery: Log more data for account recovery submissions [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207963
[19:57:39] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204948|De-configure cookie-based enrollment in PHP 8.3 (T405955)]] (duration: 10m 03s)
[19:57:44] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[19:59:48] <wikibugs>	 (03CR) 10Reedy: [C:03+2] AccountRecovery: Log more data for account recovery submissions [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207963 (owner: 10Reedy)
[20:00:53] <Reedy>	 swfrench-wmf: Do you need to deploy anything else?
[20:01:00] <swfrench-wmf>	 Reedy: all done!
[20:01:05] <wikibugs>	 (03Merged) 10jenkins-bot: AccountRecovery: Log more data for account recovery submissions [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207963 (owner: 10Reedy)
[20:01:05] <Reedy>	 cool, cheers
[20:01:59] <logmsgbot>	 !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1207963|AccountRecovery: Log more data for account recovery submissions]]
[20:03:05] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T410589)', diff saved to https://phabricator.wikimedia.org/P85422 and previous config saved to /var/cache/conftool/dbconfig/20251120-200304-ladsgroup.json
[20:03:09] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[20:03:16] <wikibugs>	 (03CR) 10Dzahn: "using this gerrit patch also for open discussion what to do with it" [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn)
[20:04:14] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy6002.wikimedia.org with reason: host reimage
[20:06:17] <logmsgbot>	 !log reedy@deploy2002 reedy: Backport for [[gerrit:1207963|AccountRecovery: Log more data for account recovery submissions]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:07:19] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy6001.wikimedia.org with OS bookworm
[20:07:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[20:07:40] <logmsgbot>	 !log reedy@deploy2002 reedy: Continuing with sync
[20:08:09] <wikibugs>	 (03PS1) 10Dzahn: admin: remove fisch-wmde from releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/1207967
[20:08:39] <wikibugs>	 (03PS2) 10Dzahn: admin: remove wmde-fisch from releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/1207967
[20:11:41] <logmsgbot>	 !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207963|AccountRecovery: Log more data for account recovery submissions]] (duration: 09m 42s)
[20:12:33] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy7002.wikimedia.org with OS bookworm
[20:12:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[20:14:23] <wikibugs>	 (03CR) 10Aklapper: [C:03+1] "Looking at the list of four repos on https://phabricator.wikimedia.org/diffusion/query/6WUBLfM9eS2R/ , is it an intentional decision that " [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn)
[20:14:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:15:32] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy7001.wikimedia.org with OS bookworm
[20:15:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[20:16:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "don't have too much background here but merging per "beta only"" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[20:17:15] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[20:18:08] <wikibugs>	 (03CR) 10Dzahn: "arrg, I am sorry, I am duplicating my own existing patch from the past" [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn)
[20:18:13] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P85423 and previous config saved to /var/cache/conftool/dbconfig/20251120-201812-ladsgroup.json
[20:18:24] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job mtail in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:22:26] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11394056 (10Ladsgroup) Very likely a popular gadget/css hardcoding the url. I investigate once I get my hands on a PC
[20:24:18] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy6002.wikimedia.org with OS bookworm
[20:24:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte...
[20:27:01] <wikibugs>	 (03CR) 10Dzahn: "please still feel asked to review but move the discussion over to existing comments from Krinkle -> https://gerrit.wikimedia.org/r/c/opera" [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn)
[20:27:08] <wikibugs>	 (03Abandoned) 10Dzahn: switch historic Subversion URLs from Phabricator to static-codereview [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn)
[20:27:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394058 (10ssingh) 05Open→03Resolved a:03ssingh Once https://ger...
[20:33:20] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P85424 and previous config saved to /var/cache/conftool/dbconfig/20251120-203320-ladsgroup.json
[20:36:34] <wikibugs>	 (03PS1) 10Ssingh: hiera: trafficserver: switch hcaptcha backend to anycast [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780)
[20:37:24] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7666/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[20:37:42] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:04-2] "Do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[20:38:57] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:39:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:44:51] <wikibugs>	 (03PS1) 10Scott French: deployment_server: drop PHP 8.1 fallback in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955)
[20:44:53] <wikibugs>	 (03PS1) 10Scott French: deployment_server: switch mw-script/main to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1207980 (https://phabricator.wikimedia.org/T405955)
[20:44:54] <wikibugs>	 (03PS1) 10Scott French: deployment_server: follow main release in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207981 (https://phabricator.wikimedia.org/T405955)
[20:46:12] <wikibugs>	 (03PS2) 10Scott French: deployment_server: switch mw-script/main to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1207980 (https://phabricator.wikimedia.org/T405955)
[20:46:13] <wikibugs>	 (03PS2) 10Scott French: deployment_server: follow main release in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207981 (https://phabricator.wikimedia.org/T405955)
[20:47:07] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[20:48:28] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T410589)', diff saved to https://phabricator.wikimedia.org/P85425 and previous config saved to /var/cache/conftool/dbconfig/20251120-204827-ladsgroup.json
[20:48:32] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[20:48:44] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[20:48:52] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T410589)', diff saved to https://phabricator.wikimedia.org/P85426 and previous config saved to /var/cache/conftool/dbconfig/20251120-204852-ladsgroup.json
[20:54:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[20:59:46] <bd808>	 jouncebot: nowandnext
[20:59:46] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1900)
[20:59:46] <jouncebot>	 In 0 hour(s) and 0 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T2100)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T2100).
[21:00:05] <jouncebot>	 bvibber and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:09] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler)
[21:00:11] <bvibber>	 o/
[21:00:20] <James_F>	 Hey.
[21:00:23] <bvibber>	 i can spiderpig one or both in a pinch
[21:00:29] <James_F>	 Sure, go for it.
[21:00:35] <bvibber>	 woo
[21:00:36] <James_F>	 My one is nominally trivial.
[21:00:43] <James_F>	 (He says…)
[21:00:45] <bvibber>	 mine is a one-character change :D
[21:00:54] <James_F>	 Mine is a one-extension change. ;-)
[21:01:02] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[21:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester)
[21:02:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:02:06] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/1207864 (https://phabricator.wikimedia.org/T410628) (owner: 10Slyngshede)
[21:02:37] <wikibugs>	 (03Merged) 10jenkins-bot: Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[21:02:40] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy the WikimediaEditorTasks extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester)
[21:03:00] <logmsgbot>	 !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1207273|Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps (T372165)]], [[gerrit:1207865|Undeploy the WikimediaEditorTasks extension (T376954)]]
[21:03:06] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:03:07] <stashbot>	 T376954: Stop using and then undeploy the WikimediaEditorTasks extension - https://phabricator.wikimedia.org/T376954
[21:04:03] <bvibber>	 hmm, i wonder if removing an extension will trigger a slow localization cache sync actually :D
[21:04:11] <bvibber>	 no worries, perfect time for it to be slow
[21:04:31] <James_F>	 Yeah, sorry, didn't think of that. 
[21:04:43] <James_F>	 OTOH, your patch is beta-only, so you're not waiting for the sync anyway.
[21:04:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[21:06:02] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:08:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11394161 (10Jclark-ctr) a:05LSobanski→03RobH
[21:09:09] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:09:21] <bd808>	 As annoying as it is, I think it is nice that a slow deploy today takes like 20 minutes instead of the 60 minutes that a full scap with l10n rebuild once took. We are getting better in increments. :)
[21:10:01] <James_F>	 True.
[21:10:08] <bd808>	 there is still a lot to shake fists at though. 
[21:10:10] <James_F>	 Also `21:04:09 Finished l10n-update (duration: 01m 06s)`.
[21:10:26] <James_F>	 Though the images deltas will be big, of course.
[21:10:40] <James_F>	 The load-i18n-from-JSON work would be nice
[21:10:51] <bvibber>	 whee
[21:11:26] <bvibber>	 we just need to put the localization cache in a redis cluster and put a front-end cache on it
[21:11:35] <bd808>	 I'm not sure what at this point would be the "better" replacement for the CDB files.
[21:11:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:11:52] <James_F>	 I'm not sure a LRU cache is ideal for a pan-lingual message cache. :-)
[21:12:37] <bd808>	 if we can afford to externalize it all that would be nice. having l10noid or something as the magic fast external, shared cache
[21:12:44] <taavi>	 separate the l10n cache from mediawiki deployments by introducing a new localisatoid microservice
[21:12:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[21:13:02] <rzl>	 call it the remotisation cache
[21:13:16] <bd808>	 there are a seriously hot paths for l10n lookup
[21:13:22] <James_F>	 Yes.
[21:13:55] <James_F>	 So hot that I'd not want it to go over the k8s service boundary.
[21:14:02] <James_F>	 Maybe a sidecar? Eh.
[21:14:31] <bvibber>	 having an efficiently queriable on-device database makes a lot of sense for our use case where shit is extensively using these lookups serially during long page generations :D
[21:14:40] <bvibber>	 but updating those databases efficiently seems hard
[21:14:59] <bd808>	 as I understand it, one of the slow parts today is bundling all of the messages in a given language into a big blob. The json stuff we did to make rsync for that faster doesn't help with adding to a Docker container's layer size
[21:15:27] <bd808>	 the slow is not really the bundling, but what it does to the layer delta in the image
[21:15:58] <bd808>	 and adding 1 en message without translation touches every other language
[21:16:09] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11394186 (10Dzahn) a:03OKryva-WMF Hi Ollie, let us know how it's going. Cheers. -- Daniel
[21:16:13] <bd808>	 (I think)
[21:16:13] <bvibber>	 i feel like append-only updates to the files is what we really want? but that requires depending on the previous build output to make a new build
[21:16:17] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11394188 (10Dzahn) p:05Triage→03Medium
[21:17:06] <bd808>	 In the docker containers our deltas are at a file level rather than a line level
[21:17:21] <bvibber>	 oof
[21:17:40] <James_F>	 Yes, we take per-language-sparse-json i18n files, build them into per-language-complete-cbd files, and write them into the docker image alongside the actual code as a relatively low layer (I think?); switching to per-language-complete-json files without changing the docker build step won't help much, but might be a smidge faster to build in scap and load inside MW. The main value is the… yes.
[21:17:57] <James_F>	 Speaking of which, docker image build from scape is still on-going, 14 minutes later. Sigh.
[21:18:26] <bvibber>	 xkcd 303
[21:18:40] * bd808 was trying to keep y'all occupied so you wouldn't worry about the wall clock time ;)
[21:18:45] <James_F>	 Indeed.
[21:19:01] <James_F>	 I can do code review with an eye open for spiderpig and IRC. :-)
[21:19:04] <bvibber>	 hehe
[21:19:58] <bd808>	 getting to single version containers will help some I think
[21:20:20] <James_F>	 Yes, if nothing else, half the bytes shipped.
[21:20:24] <dancy>	 bvibber: The incremental image build process that we use does in fact require access to the previous build, so that is not an unreasonable requirement.
[21:21:01] <bvibber>	 oooh spiffy
[21:21:49] <bd808>	 there is still a 5 minute wait penalty on large image uploads into the container registry to work around a bug in the swift storage backend
[21:22:24] <James_F>	 Oh, yeah. Will the smaller images from single-version-containers help with that?
[21:22:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[21:23:25] <bd808>	 I think so? We trigger the wait on the image size delta if I am remembering correctly
[21:23:38] <bd808>	 dancy has done neat and smart things in the incremental builds.
[21:24:13] <bd808>	 > Finished build-and-push-container-images (duration: 19m 48s)
[21:24:35] <James_F>	 Hurrah. Finally.
[21:24:52] <James_F>	 I do miss the sub-60-seconds config deploys I used to do back in the day.
[21:25:11] <James_F>	 Pre-linting, pre-canaries, pre-everything. But gosh was it fast.
[21:25:16] <bvibber>	 hehe
[21:25:18] <bd808>	 yeah, file sync was nice for some stuff for sure
[21:25:33] <James_F>	 But also allowed us to take down production with arrray().
[21:25:40] <James_F>	 So… let's not regress.
[21:25:58] <bd808>	 I'm sure bvibber can tell of the days when it was as easy as using vim on the nfs server :)
[21:26:24] <bvibber>	 oh yeah i remember straight up debugging editing files on nfs raw
[21:26:33] <bvibber>	 add some more printfs
[21:26:35] <bvibber>	 :D
[21:26:56] <James_F>	 Fun times.
[21:26:57] <bd808>	 I was still doing that for wikitech up to like 2023 :)
[21:27:32] <James_F>	 I mean, there were some patches that could only be deployed by me making manual changes to mediawiki-staging and syncing them bit by bit.
[21:27:36] * James_F shudders.
[21:28:10] <Amir1>	 bvibber: I think this is labs only patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207273
[21:28:29] <James_F>	 Amir1: Yes, it was.
[21:28:34] <Amir1>	 don't we need something for production or it's somewhere and I missed it :D 
[21:28:45] <Amir1>	 if it's intentional, then go ahead
[21:28:57] <bd808>	 Welcome back to another episode of Everything sucks today, but wait until you hear how bad it used to suck with your hosts bvibber, James_F, and bd808 
[21:29:10] <logmsgbot>	 !log bvibber@deploy2002 bvibber, jforrester: Backport for [[gerrit:1207273|Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps (T372165)]], [[gerrit:1207865|Undeploy the WikimediaEditorTasks extension (T376954)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:29:16] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:29:16] <stashbot>	 T376954: Stop using and then undeploy the WikimediaEditorTasks extension - https://phabricator.wikimedia.org/T376954
[21:29:17] <bvibber>	 Amir1: aaaaaaagh you're right
[21:29:19] <James_F>	 bd808: I'd listen to that podcast.
[21:29:24] <bvibber>	 i'll fix that.... when this is done
[21:29:38] <Amir1>	 would you sell energy drinks too? I'd buy
[21:29:50] <Reedy>	 Amir1: what about an NFT?
[21:30:03] <Reedy>	 Some ponzi scheme crypto?
[21:30:04] <bd808>	 I would like a pain relier as one of the sponsors
[21:30:08] <James_F>	 bvibber: Good to proceed at my end.
[21:30:22] <logmsgbot>	 !log bvibber@deploy2002 bvibber, jforrester: Continuing with sync
[21:31:11] <Amir1>	 NFS -> NFT, not that different 
[21:31:22] <bd808>	 James_F: We actually could do a live version at the next hackathon... special guests Reedy and Amir1 to add more commentary.
[21:31:25] <Reedy>	 Makes you regret your life choices?
[21:31:31] <wikibugs>	 (03PS1) 10Bvibber: Fix wgMediaViewerThumbnailBucketSizes on prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208003 (https://phabricator.wikimedia.org/T372165)
[21:31:58] <wikibugs>	 (03PS9) 10Cwhite: prometheus: split targets into directories by source [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223)
[21:32:25] <bvibber>	 shit maybe we should do a CDB replacement project at hackathon
[21:32:38] <Reedy>	 I hear AI has a great condensed format to replace json
[21:32:40] <wikibugs>	 (03CR) 10Cwhite: prometheus: split targets into directories by source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite)
[21:33:17] <James_F>	 bvibber: Sounds fun!
[21:33:23] <brennen>	 the amount of "why is this taking so long" that could address
[21:33:43] <dancy>	 That would be great
[21:34:26] <James_F>	 Ideally someone from RelEng would be there, rather than us just hacking on their day job and making a mess.
[21:34:41] <bvibber>	 "how hard can an append-only database with efficient lookups be"
[21:34:41] <bvibber>	 R.I.P. Brooke Vibber, 1978-2025 died from reading too many computer science papers
[21:35:13] <bd808>	 single version containers + l10n in php + op code cache
[21:35:30] <bvibber>	 actually i wonder about the relative preformance of sqlite
[21:35:43] <bvibber>	 probably too slow though
[21:35:49] <wikibugs>	 (03PS1) 10Reedy: AccountRecovery: Allow temp users to access Special:AccountRecovery [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208005
[21:35:59] <dancy>	 Q: Isn't the l10n data stored in memcache or something like that?
[21:36:26] <bd808>	 sqlite should be testable, but I'd ask Tim first if he already tried it
[21:37:05] <bd808>	 dancy: you can configure for that, but CDB is faster
[21:37:13] <tappof>	 !log /srv/thanos-store cleanup on titan2001 (start)
[21:37:15] <dancy>	 I see.
[21:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:40] <wikibugs>	 (03CR) 10Bking: [C:03+2] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson)
[21:37:43] <bd808>	 no network latency and really fast indexing
[21:37:57] <dancy>	 Maybe related question: What does the purgeMessageBlobStore.php maintenance script do?
[21:37:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[21:38:15] <dancy>	 and does it have any relevant for the k8s deployments.
[21:38:20] <dancy>	 *relevance
[21:38:28] <wikibugs>	 (03CR) 10Dzahn: "to clarify: the review request was not about checking every line of code. it was meant to be like "hey, is this the right place and idea t" [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb)
[21:38:34] <Reedy>	 Clears some ResourceLoader related message blobs
[21:38:37] <dancy>	 scap sync-world runs it at the end of the each deployment.
[21:39:12] <James_F>	 I vaguely think this is still needed.
[21:39:14] <Reedy>	 $cache->touchCheckKey( self::makeGlobalPurgeKey( $cache ), $cache::HOLDOFF_TTL_NONE );
[21:39:16] <bd808>	 MessageBlobStore is the cache of 10n strings for javascript, right?
[21:39:26] <bd808>	 *l10n
[21:39:33] <James_F>	 Yes.
[21:39:37] <dancy>	 ah
[21:39:51] <James_F>	 And it's persistent between MW images.
[21:39:51] <wikibugs>	 (03CR) 10Reedy: [C:03+2] AccountRecovery: Allow temp users to access Special:AccountRecovery [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208005 (owner: 10Reedy)
[21:40:10] <James_F>	 bvibber: It turns out Reedy is deploying over your patch. ;-)
[21:40:17] <Reedy>	 Ci-ing
[21:40:19] <bvibber>	 hah
[21:40:56] <wikibugs>	 (03Merged) 10jenkins-bot: AccountRecovery: Allow temp users to access Special:AccountRecovery [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208005 (owner: 10Reedy)
[21:40:59] <wikibugs>	 (03CR) 10Dzahn: "@aklapper@wikimedia.org That is a good question. I don't have the answer I think. If you don't mind could you repeat that maybe on the dup" [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn)
[21:41:17] <wikibugs>	 (03PS1) 10Scott French: deployment_server: switch deployment hosts to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208006 (https://phabricator.wikimedia.org/T405955)
[21:41:17] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208006 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[21:41:29] <dancy>	 I like the idea of PHP-only l10n files, with multiple files per language, to allow for incremental changes to specific keys
[21:42:55] <logmsgbot>	 !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207273|Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps (T372165)]], [[gerrit:1207865|Undeploy the WikimediaEditorTasks extension (T376954)]] (duration: 39m 55s)
[21:43:01] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:43:02] <stashbot>	 T376954: Stop using and then undeploy the WikimediaEditorTasks extension - https://phabricator.wikimedia.org/T376954
[21:43:31] <bvibber>	 dancy: ok you're gonna laugh but i think i can make php array source work as an rsync-friendly format by manipulating whitespace or comments
[21:44:06] <bvibber>	 actually can i just append things and they'll overwrite?
[21:44:29] <bvibber>	 Reedy: were you needing to deploy something?
[21:44:53] <bvibber>	 if not i'll do 1208003
[21:44:56] <Reedy>	 Yeah, I was going to deploy something to fix an issue with Special:AccountRecovery for temp accounts
[21:45:02] <dancy>	 bvibber: Any change to a file will result in the new copy of the file being in the image in full,  shadowing the old file.
[21:45:06] <Reedy>	 You can stick it out at the same time as something else though too
[21:45:20] <bvibber>	 spiffy. you might throwing my 1208003 in with your wash? :)
[21:45:23] <bvibber>	 *mind
[21:46:06] <dancy>	 bvibber: I encourage hacking and I'm happy to test ideas in my train-dev environment.
[21:46:07] <bvibber>	 dancy: ah in that case i'll need a file per version. that has performance implications but is manageable
[21:46:13] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Fix wgMediaViewerThumbnailBucketSizes on prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208003 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[21:46:14] <bvibber>	 cool!
[21:46:24] <bd808>	 Docker layer diffs being file scoped rather than line scoped is a thing that I keep forgetting and remembering again
[21:46:56] <dancy>	 bvibber: Agreed.  It's a matter of tradeoffs but there's probably a reasonable threshold where it can still pay off.
[21:47:02] <wikibugs>	 (03Merged) 10jenkins-bot: Fix wgMediaViewerThumbnailBucketSizes on prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208003 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[21:47:04] <bvibber>	 yeah definitely
[21:47:39] <James_F>	 Maybe we should make a Phab task for this idea as a Hackathon project, so we can CC historically-involed people like K.rinkle and _.joe_ in case they're interested?
[21:48:00] <James_F>	 https://phabricator.wikimedia.org/project/view/8319/
[21:48:40] <logmsgbot>	 !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1208003|Fix wgMediaViewerThumbnailBucketSizes on prod (T372165)]], [[gerrit:1208005|AccountRecovery: Allow temp users to access Special:AccountRecovery]]
[21:48:44] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:50:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: base: Add starship in trixie and beyond [puppet] - 10https://gerrit.wikimedia.org/r/1208012
[21:50:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: base: Switch away from legacy fact, lint ignore $::realm [puppet] - 10https://gerrit.wikimedia.org/r/1208013
[21:51:02] <bvibber>	 James_F: let's do it! you wanna open a task or shall i?
[21:51:24] <James_F>	 bvibber: You go for it. I'll CC in.
[21:51:28] <bvibber>	 sweet
[21:51:34] <Reedy>	 bvibber: Do you need/want to test yours when it's ready? Or don't really care?
[21:51:43] <James_F>	 You have more social capital, after all. :-)
[21:51:46] <bvibber>	 Reedy: naw it's functionally identical to the previosu behavior
[21:51:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:52:02] <bvibber>	 just ... no longer setting a setting that doesn't work in most cases ;)
[21:52:40] <bvibber>	 James_F: the best thing about being in the bug-oisie for so long is having all this social capital
[21:53:22] <bd808>	 T99740 is a thing to read too when thinking about CDB -> PHP shifts
[21:53:22] <stashbot>	 T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740
[21:54:06] <logmsgbot>	 !log reedy@deploy2002 bvibber, reedy: Backport for [[gerrit:1208003|Fix wgMediaViewerThumbnailBucketSizes on prod (T372165)]], [[gerrit:1208005|AccountRecovery: Allow temp users to access Special:AccountRecovery]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:54:12] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:54:28] <bvibber>	 woo
[21:54:37] <Reedy>	 I note that irc ping is happening before a short hang to let you continue on the console
[21:54:39] <logmsgbot>	 !log reedy@deploy2002 bvibber, reedy: Continuing with sync
[21:54:56] <dancy>	 Reedy: That was a requested feature. :-)
[21:55:10] <Reedy>	 "GET READY TO TEST"
[21:55:12] * taavi fears it was them who requested it
[21:55:43] <Reedy>	 https://imgflip.com/i/acr2d0
[21:55:52] <wikibugs>	 (03PS1) 10Jforrester: tables-catalog: Drop WikimediaEditorTasks tables [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954)
[21:55:57] <dancy>	 haha
[21:56:24] <Reedy>	 If we did everything on slack, we could have so many more images inline in the process
[21:56:44] <dancy>	 https://phabricator.wikimedia.org/T378740
[21:56:46] <Reedy>	 and emojis
[21:56:51] <taavi>	 Reedy: no
[21:56:52] <dancy>	 Discuss there. :-)
[21:56:54] <bd808>	 Reedy: this is the real reason for spiderpig
[21:59:55] <maryum>	 hi! I would like to do a security deploy. is something currently going on?
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T2200)
[22:00:24] <bvibber>	 James_F: https://phabricator.wikimedia.org/T410694 for starters
[22:00:46] <logmsgbot>	 !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208003|Fix wgMediaViewerThumbnailBucketSizes on prod (T372165)]], [[gerrit:1208005|AccountRecovery: Allow temp users to access Special:AccountRecovery]] (duration: 12m 06s)
[22:00:51] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[22:01:08] <bd808>	 maryum: ^ check with Reedy 
[22:01:49] <Reedy>	 My (and bvibber's deploy) is done
[22:02:02] <bvibber>	 whee thanks Reedy 
[22:02:24] <Reedy>	 Need to see if the web team show up to use the window
[22:03:24] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:04:04] <sbassett>	 99.9% of the time they do not (web team)
[22:04:26] <jinxer-wm>	 FIRING: InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 1084 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh
[22:04:46] <swfrench-wmf>	 o/
[22:04:55] <cwhite>	 o/
[22:04:58] <swfrench-wmf>	 !incidents
[22:04:58] <sirenbot>	 7041 (UNACKED)  InboundMXQueueHigh sre (mx-in1001:9154 eqiad)
[22:04:58] <sirenbot>	 7036 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr1-codfw.wikimedia.org)
[22:05:03] <swfrench-wmf>	 !ack 7041
[22:05:03] <sirenbot>	 7041 (ACKED)  InboundMXQueueHigh sre (mx-in1001:9154 eqiad)
[22:05:50] <swfrench-wmf>	 cwhite: I seem to recall this happening recently
[22:06:20] <sbassett>	 SRE folks: should we hold off on the couple of security deploys while you look into the above issue?
[22:06:50] <James_F>	 bvibber: Tsk, localisation cache is the canonical spelling in MW-land! ;-)
[22:07:03] <swfrench-wmf>	 sbassett: I think you should be good to go, cwhite - any concerns?
[22:07:26] <cwhite>	 I agree - probably ok to continue with deploys
[22:07:44] * swfrench-wmf thumbs up
[22:07:51] <James_F>	 Yes, if they've not shown up by now, go for it.
[22:08:23] <maryum>	 James_F I'm about to start my deploy in a min
[22:08:34] <James_F>	 maryum: +1
[22:08:57] <bvibber>	 lol
[22:09:26] <jinxer-wm>	 FIRING: [2x] InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 1084 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh
[22:09:36] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[22:09:49] <swfrench-wmf>	 !incidents
[22:09:50] <sirenbot>	 7041 (ACKED)  InboundMXQueueHigh sre (mx-in1001:9154 eqiad)
[22:09:50] <sirenbot>	 7036 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr1-codfw.wikimedia.org)
[22:12:54] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi! Could you please add tests for the alerts?" [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb)
[22:12:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[22:14:26] <jinxer-wm>	 RESOLVED: [2x] InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 1084 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh
[22:15:13] <wikibugs>	 (03CR) 10Jforrester: "The actually dropping of these tables, T410692, should happen first!" [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester)
[22:15:24] <bvibber>	 end conflict in phab
[22:15:29] <bvibber>	 *edit conflict in phab, just like old times
[22:16:28] <bd808>	 bvibber: yeah, sorry. I think I got your changes back?
[22:16:36] <bvibber>	 yep thx :D
[22:16:39] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] Alertmanager: Add fr-tech-ops and update fr-tech groups [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt)
[22:16:50] <bd808>	 phab's complete lack of conflict detection is annoying
[22:17:10] <maryum>	 scap currently running
[22:18:14] <wikibugs>	 (03CR) 10Cwhite: [C:04-1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight)
[22:19:27] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[22:22:57] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[22:24:10] <wikibugs>	 (03PS1) 10Bking: opensearch: Fix typo in repo GPG key filename. [puppet] - 10https://gerrit.wikimedia.org/r/1208020 (https://phabricator.wikimedia.org/T407123)
[22:24:46] <wikibugs>	 (03PS1) 10SBassett: ActionApi: Remove the xslt option [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208021 (https://phabricator.wikimedia.org/T401987)
[22:24:48] <logmsgbot>	 !log mstyles Deployed security patch for T407157
[22:25:05] <maryum>	 scap finished!
[22:25:42] <maryum>	 sbassett all yours now
[22:27:06] <sbassett>	 tx
[22:29:16] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[22:31:35] <wikibugs>	 (03PS1) 10MusikAnimal: ChangesListHooks: show entity titles in recent changes and watchlists [extensions/CommunityRequests] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208022 (https://phabricator.wikimedia.org/T406957)
[22:35:44] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] opensearch: Fix typo in repo GPG key filename. [puppet] - 10https://gerrit.wikimedia.org/r/1208020 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking)
[22:35:56] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch: Fix typo in repo GPG key filename. [puppet] - 10https://gerrit.wikimedia.org/r/1208020 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking)
[22:36:00] <icinga-wm>	 PROBLEM - Host cirrussearch2061 is DOWN: PING CRITICAL - Packet loss = 100%
[22:37:28] <icinga-wm>	 RECOVERY - Host cirrussearch2061 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms
[22:38:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208021 (https://phabricator.wikimedia.org/T401987) (owner: 10SBassett)
[22:39:00] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Add redis-tools to bastions [puppet] - 10https://gerrit.wikimedia.org/r/1208023 (https://phabricator.wikimedia.org/T410102)
[22:39:45] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge[1008-1010].eqiad.wmnet with reason: T410681
[22:39:50] <stashbot>	 T410681: Setup opensearch 3 on relforge servers - https://phabricator.wikimedia.org/T410681
[22:42:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS trixie
[22:42:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:42:36] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2061 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:43:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS trixie
[22:43:09] <wikibugs>	 (03Merged) 10jenkins-bot: ActionApi: Remove the xslt option [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208021 (https://phabricator.wikimedia.org/T401987) (owner: 10SBassett)
[22:43:28] <logmsgbot>	 !log sbassett@deploy2002 Started scap sync-world: Backport for [[gerrit:1208021|ActionApi: Remove the xslt option (T401987 T401995)]]
[22:43:33] <stashbot>	 T401987: Consider deprecating/removing the xslt option from the action api - https://phabricator.wikimedia.org/T401987
[22:43:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1010.eqiad.wmnet with OS trixie
[22:45:02] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860
[22:45:07] <stashbot>	 T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860
[22:46:11] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1010.eqiad.wmnet with OS trixie
[22:46:15] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1009.eqiad.wmnet with OS trixie
[22:46:23] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1008.eqiad.wmnet with OS trixie
[22:47:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1010.eqiad.wmnet with OS bookworm
[22:48:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bookworm
[22:48:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm
[22:49:39] <inflatador>	 !log bking@apt1002 reprepro --component thirdparty/opensearch3 update bookworm-wikimedia
[22:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:50:21] <wikibugs>	 (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/output/1208023/7668/" [puppet] - 10https://gerrit.wikimedia.org/r/1208023 (https://phabricator.wikimedia.org/T410102) (owner: 10BryanDavis)
[22:50:37] <inflatador>	 !log bking@apt1002 reprepro --component thirdparty/opensearch2 update bookworm-wikimedia
[22:50:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:37] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2061 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:54:16] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:54:17] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[22:55:24] <logmsgbot>	 bking@cumin2002 reimage (PID 1383021) is awaiting input
[22:55:58] <logmsgbot>	 bking@cumin2002 reimage (PID 1383273) is awaiting input
[22:57:22] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1008.eqiad.wmnet with OS bookworm
[22:57:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:57:30] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1009.eqiad.wmnet with OS bookworm
[22:58:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm
[22:59:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage
[23:04:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage
[23:05:32] <logmsgbot>	 bking@cumin2002 reimage (PID 1388071) is awaiting input
[23:07:02] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1008.eqiad.wmnet with OS bookworm
[23:07:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm
[23:09:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bookworm
[23:10:14] <swfrench-wmf>	 !log restarted postfix on mx-in1001, mx-in2001 at ~ 23:00 UTC for config change
[23:10:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:56] <wikibugs>	 (03CR) 10WMDE-Fisch: [V:03+1] admin: remove wmde-fisch from releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/1207967 (owner: 10Dzahn)
[23:14:16] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[23:17:04] <logmsgbot>	 bking@cumin2002 reimage (PID 1394483) is awaiting input
[23:19:17] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[23:19:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1010.eqiad.wmnet with OS bookworm
[23:19:37] <logmsgbot>	 !log sbassett@deploy2002 sbassett: Backport for [[gerrit:1208021|ActionApi: Remove the xslt option (T401987 T401995)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:19:43] <stashbot>	 T401987: Consider deprecating/removing the xslt option from the action api - https://phabricator.wikimedia.org/T401987
[23:20:00] <logmsgbot>	 !log sbassett@deploy2002 sbassett: Continuing with sync
[23:24:17] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:24:17] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[23:25:13] <tappof>	 !log /srv/thanos-store cleanup on titan2001 (end)
[23:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:41] <wikibugs>	 (03CR) 10Bking: [C:03+1] "Thanks for reaching out on this one!" [puppet] - 10https://gerrit.wikimedia.org/r/1203548 (owner: 10CDanis)
[23:32:46] <logmsgbot>	 !log sbassett@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208021|ActionApi: Remove the xslt option (T401987 T401995)]] (duration: 49m 18s)
[23:32:51] <stashbot>	 T401987: Consider deprecating/removing the xslt option from the action api - https://phabricator.wikimedia.org/T401987
[23:34:28] <bd808>	 jouncebot: nowandnext
[23:34:28] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 25 minute(s)
[23:34:28] <jouncebot>	 In 7 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251121T0700)
[23:35:22] <bd808>	 sbassett: If you are done I would like to push out a couple of wikitech config changes. No worries if you are still working on stuff.
[23:35:56] <wikibugs>	 (03PS1) 10Scott French: mw-*: clean up 8.3 migration rollingUpdate and timeout tweaks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208037 (https://phabricator.wikimedia.org/T405955)
[23:38:32] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1009.eqiad.wmnet with OS bookworm
[23:39:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bookworm
[23:39:39] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1008.eqiad.wmnet with OS bookworm
[23:40:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm
[23:41:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11394566 (10RobH) Clarification Questions and statements:  * Most hosts previously had both 1G and 10G just these new config E are 10G only so they'll have to be in 10G capable ra...
[23:41:32] <wikibugs>	 (03PS1) 10Scott French: deployment_server: switch mw-debug/pinkunicorn to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208039 (https://phabricator.wikimedia.org/T405955)
[23:45:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208022 (https://phabricator.wikimedia.org/T406957) (owner: 10MusikAnimal)
[23:46:14] <maryum>	 bd808 pretty sure sbassett is done
[23:46:26] <bd808>	 thx maryum 
[23:46:30] <logmsgbot>	 bking@cumin2002 reimage (PID 1411105) is awaiting input
[23:47:25] <logmsgbot>	 bking@cumin2002 reimage (PID 1411603) is awaiting input
[23:47:29] <bd808>	 heh. musikanimal jumped the queue before I got there
[23:47:44] <musikanimal>	 sorry! lol
[23:48:42] <bd808>	 yours at least doesn't look like it will cause a third full l10n rebuild :)
[23:48:46] <musikanimal>	 there are no localization changes so I think this one shouldn't take very long
[23:48:50] <musikanimal>	 yeah lol!
[23:52:53] <wikibugs>	 (03Merged) 10jenkins-bot: ChangesListHooks: show entity titles in recent changes and watchlists [extensions/CommunityRequests] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208022 (https://phabricator.wikimedia.org/T406957) (owner: 10MusikAnimal)
[23:53:14] <logmsgbot>	 !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1208022|ChangesListHooks: show entity titles in recent changes and watchlists (T406957)]]
[23:53:18] <stashbot>	 T406957: Show wish titles on lists (like on Wikidata) - https://phabricator.wikimedia.org/T406957
[23:57:39] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1208022|ChangesListHooks: show entity titles in recent changes and watchlists (T406957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:58:13] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Continuing with sync
[23:59:16] <jinxer-wm>	 FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage