[00:02:12] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1207293 [00:02:15] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207294 [00:02:19] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207295 [00:08:23] (03PS1) 10Dzahn: switch historic Subversion URLs from Phabricator to static-codereview [puppet] - 10https://gerrit.wikimedia.org/r/1207296 [00:09:48] (03PS2) 10Dzahn: switch historic Subversion URLs from Phabricator to static-codereview [puppet] - 10https://gerrit.wikimedia.org/r/1207296 [00:11:01] (03CR) 10Dzahn: [C:03+1] "yea, whois has our name servers but needs NS in DNS" [dns] - 10https://gerrit.wikimedia.org/r/1207293 (owner: 10Ncmonitor) [00:12:26] (03CR) 10Scott French: [C:03+1] "Thanks as always for the docs links!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [00:12:50] (03CR) 10Scott French: [C:03+1] kubernetes: Set default Envoy version to 1.32.12 [puppet] - 10https://gerrit.wikimedia.org/r/1207289 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [00:12:56] (03CR) 10Dzahn: [C:04-1] "seems like another candidate for the "dont-pay-for-wikipedia-articles"" [puppet] - 10https://gerrit.wikimedia.org/r/1207294 (owner: 10Ncmonitor) [00:13:41] (03CR) 10Scott French: [C:03+1] "Thanks for cleaning this up!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194743 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:14:37] (03CR) 10Dzahn: [C:03+1] "yea, whois has our NS" [puppet] - 10https://gerrit.wikimedia.org/r/1207295 (owner: 10Ncmonitor) [00:16:38] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207294 (owner: 10Ncmonitor) [00:19:47] (03PS1) 10Dzahn: admin: transfer group approver for releasers-mediawiki to Mateus Santos [puppet] - 10https://gerrit.wikimedia.org/r/1207304 [00:20:33] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:20] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207305 [00:23:27] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:30:10] (03PS1) 10Dzahn: admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313 [00:33:45] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207320 [00:36:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11390595 (10Scott_French) @RobH - Thanks for checking! I'll also be out 12-01. I see you mentioned 11-21, but that's Friday. Did you mean Monday 11-24? If so, that (11-24) sou... [00:38:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11390608 (10RobH) I totally messed up the dates on your comment: 2025-11-20, 2025-11-24, 2025-11-25 2025-12-03, 2025-12-04 So yeah, we can plan for the 24th (monday) no problem! [00:38:37] (03CR) 10Dzahn: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207294 (owner: 10Ncmonitor) [00:38:57] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:06] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1207293 (owner: 10Ncmonitor) [00:40:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1207321 [00:40:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1207321 (owner: 10TrainBranchBot) [00:40:24] !log brett@dns1006 START - running authdns-update [00:40:44] (03CR) 10RLazarus: [C:03+2] kartotherian, tegola-vector-tiles: Remove unused tcp_health_check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194743 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:41:24] !log brett@dns1006 END - running authdns-update [00:42:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11390612 (10Ladsgroup) If you want to, I'll be around Thursday and Friday of this week and I can depool them for you. I can also do the 10G switch too (but... [00:43:00] (03Merged) 10jenkins-bot: kartotherian, tegola-vector-tiles: Remove unused tcp_health_check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194743 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:46:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11390638 (10Scott_French) Ack, Monday 2025-11-24 it is. Thank you! [00:53:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1207321 (owner: 10TrainBranchBot) [00:59:18] !log ladsgroup@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 3 days, 0:00:00 on db1159.eqiad.wmnet with reason: Maintenance [00:59:36] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1185* gradually with 4 steps - Work done [01:00:43] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:03:15] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [01:03:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T410589)', diff saved to https://phabricator.wikimedia.org/P85397 and previous config saved to /var/cache/conftool/dbconfig/20251120-010322-ladsgroup.json [01:03:27] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:10:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1207336 [01:10:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1207336 (owner: 10TrainBranchBot) [01:14:16] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 32s) [01:14:35] PROBLEM - MariaDB Replica Lag: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:14:49] PROBLEM - MariaDB Replica Lag: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:14:49] PROBLEM - MariaDB Replica Lag: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:32:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:58] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1207336 (owner: 10TrainBranchBot) [01:39:52] !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [01:40:26] !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [01:44:55] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:03] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1185* gradually with 4 steps - Work done [02:42:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11390787 (10herron) a:05herron→03RobH >>! In T405946#11390399, @RobH wrote: > We don't want to move anything the day before a holiday or weekend, as it doesn't allow fo... [02:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:44:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:54:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [04:38:57] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:08:24] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:10:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:13:55] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (cloudcontrol2010-dev), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:14:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T410589)', diff saved to https://phabricator.wikimedia.org/P85402 and previous config saved to /var/cache/conftool/dbconfig/20251120-051454-ladsgroup.json [05:14:59] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:15:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:15:35] RECOVERY - MariaDB Replica Lag: s6 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:15:49] RECOVERY - MariaDB Replica Lag: s6 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:15:49] RECOVERY - MariaDB Replica Lag: s6 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:30:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P85403 and previous config saved to /var/cache/conftool/dbconfig/20251120-053002-ladsgroup.json [05:33:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [05:39:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [05:45:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P85404 and previous config saved to /var/cache/conftool/dbconfig/20251120-054509-ladsgroup.json [06:00:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T410589)', diff saved to https://phabricator.wikimedia.org/P85405 and previous config saved to /var/cache/conftool/dbconfig/20251120-060017-ladsgroup.json [06:00:23] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:00:34] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [06:00:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T410589)', diff saved to https://phabricator.wikimedia.org/P85406 and previous config saved to /var/cache/conftool/dbconfig/20251120-060041-ladsgroup.json [06:10:03] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:10:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11390886 (10Marostegui) @RobH as @Ladsgroup mentions, pc* hosts can only be done one at the time. I am out half today and Friday as oncall compensation. If... [06:12:52] (03PS1) 10Marostegui: db2144: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207492 (https://phabricator.wikimedia.org/T410480) [06:13:55] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:14:02] (03CR) 10Marostegui: [C:03+2] db2144: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207492 (https://phabricator.wikimedia.org/T410480) (owner: 10Marostegui) [06:14:52] (03CR) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [06:15:03] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:20:00] (03PS1) 10Marostegui: db1151: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207493 [06:20:30] (03CR) 10Marostegui: [C:03+2] db1151: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207493 (owner: 10Marostegui) [06:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T0700) [07:00:05] marostegui, Amir1, and federico3: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T0700). [07:14:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [07:19:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [07:21:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms2 T410480', diff saved to https://phabricator.wikimedia.org/P85407 and previous config saved to /var/cache/conftool/dbconfig/20251120-072110-marostegui.json [07:21:15] T410480: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480 [07:26:25] (03PS1) 10Filippo Giunchedi: cloudcephosd: move row C hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207739 (https://phabricator.wikimedia.org/T399180) [07:26:27] (03PS1) 10Filippo Giunchedi: cloudcephosd: move row D hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207740 (https://phabricator.wikimedia.org/T399180) [07:26:29] (03PS1) 10Filippo Giunchedi: cloudcephosd: move rack E4 hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207741 (https://phabricator.wikimedia.org/T399180) [07:26:31] (03PS1) 10Filippo Giunchedi: cloudcephosd: move rack F4 hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207742 (https://phabricator.wikimedia.org/T399180) [07:26:32] (03PS1) 10Filippo Giunchedi: cloudcephosd: move codfw hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207743 (https://phabricator.wikimedia.org/T399180) [07:26:43] (03PS1) 10DCausse: Revert "cirrus: start A/B test on completion with default_sort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 [07:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:35:39] (03PS1) 10Filippo Giunchedi: install_server: remove unused raid10-4dev-trixie.cfg and reuse-raid10-8dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1207749 [07:36:07] (03CR) 10Filippo Giunchedi: "Also reuse-raid10-8dev according to the comments is specific to kafka-main" [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi) [07:44:39] (03PS1) 10Marostegui: check_private_data_report: Add clouddb102[45] [puppet] - 10https://gerrit.wikimedia.org/r/1207757 (https://phabricator.wikimedia.org/T409557) [07:45:09] (03PS2) 10DCausse: Revert "cirrus: start A/B test on completion with default_sort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 [07:45:09] (03PS1) 10DCausse: cirrus: enable default_sort for completion on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207758 (https://phabricator.wikimedia.org/T404858) [07:45:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 (owner: 10DCausse) [07:45:57] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add clouddb102[45] [puppet] - 10https://gerrit.wikimedia.org/r/1207757 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [07:49:47] (03PS3) 10DCausse: Revert "cirrus: start A/B test on completion with default_sort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 (https://phabricator.wikimedia.org/T404858) [07:49:49] (03PS2) 10DCausse: cirrus: enable default_sort for completion on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207758 (https://phabricator.wikimedia.org/T404858) [07:53:13] 06SRE: Improve "reuse" feature for standard partman recipes - https://phabricator.wikimedia.org/T410601 (10fgiunchedi) 03NEW [07:57:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T0800). [08:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:11] o/ [08:00:15] I can deploy [08:01:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:02:40] (03PS3) 10Arnaudb: gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) [08:02:43] (03Merged) 10jenkins-bot: Revert "cirrus: start A/B test on completion with default_sort" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207744 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:04:16] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1207744|Revert "cirrus: start A/B test on completion with default_sort" (T404858)]] [08:04:20] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:06:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:09:10] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1207744|Revert "cirrus: start A/B test on completion with default_sort" (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:11:31] dcausse: I'll do a backport when you're done [08:11:38] ack [08:12:58] !log dcausse@deploy2002 dcausse: Continuing with sync [08:13:02] !log installing squid security updates [08:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:10] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207744|Revert "cirrus: start A/B test on completion with default_sort" (T404858)]] (duration: 12m 54s) [08:17:15] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:17:46] kostajh: I'm done [08:19:48] dcausse: thanks [08:20:25] I need ~20 minutes or so before I can start [08:21:37] (03PS1) 10Bartosz Wójtowicz: ml-services: Add CIDRs enabling pod-to-pod communication. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207785 (https://phabricator.wikimedia.org/T408538) [08:21:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:22:29] (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1207786 [08:22:52] (03PS2) 10Muehlenhoff: sre.hosts.decommission: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1207122 [08:26:53] (03CR) 10Muehlenhoff: "Per git history raid10-4dev-trixie.cfg was only recently added by Andrew for some test, adding him for comments" [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi) [08:28:03] (03PS1) 10Kosta Harlan: hCaptcha: Log the risk score for null edits differently [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207788 (https://phabricator.wikimedia.org/T410550) [08:28:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207788 (https://phabricator.wikimedia.org/T410550) (owner: 10Kosta Harlan) [08:28:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207788 (https://phabricator.wikimedia.org/T410550) (owner: 10Kosta Harlan) [08:29:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1207304 (owner: 10Dzahn) [08:32:37] (03CR) 10Muehlenhoff: admin: deprecate the releasers-blubber group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [08:32:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:35:47] (03Merged) 10jenkins-bot: hCaptcha: Log the risk score for null edits differently [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207788 (https://phabricator.wikimedia.org/T410550) (owner: 10Kosta Harlan) [08:36:22] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1207788|hCaptcha: Log the risk score for null edits differently (T410550)]] [08:36:27] T410550: hCaptcha: log risk score of null edits with other action than `edit` - https://phabricator.wikimedia.org/T410550 [08:36:29] (03CR) 10Muehlenhoff: garage: Productionize garage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [08:37:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:38:29] (03CR) 10Filippo Giunchedi: "Indeed, IIRC that was added to debug https://phabricator.wikimedia.org/T407586 which is now fixed" [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi) [08:38:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, few nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [08:38:57] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:39:47] (03CR) 10Muehlenhoff: [C:03+1] "Makes sense. The reuse-raid10-8dev.cfg was only used for the now decommed old nodes and is no longer needed." [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi) [08:40:09] (03CR) 10Muehlenhoff: [C:03+2] sre.hosts.decommission: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1207122 (owner: 10Muehlenhoff) [08:40:12] (03CR) 10Filippo Giunchedi: [C:03+2] install_server: remove unused raid10-4dev-trixie.cfg and reuse-raid10-8dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1207749 (owner: 10Filippo Giunchedi) [08:40:50] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1207788|hCaptcha: Log the risk score for null edits differently (T410550)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:41:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1203871 (owner: 10Elukey) [08:43:10] !log kharlan@deploy2002 kharlan: Continuing with sync [08:45:29] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@f3216ec] (releasing): testing deploy to failover host [08:45:59] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@f3216ec] (releasing): testing deploy to failover host (duration: 00m 30s) [08:47:13] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207788|hCaptcha: Log the risk score for null edits differently (T410550)]] (duration: 10m 51s) [08:47:18] T410550: hCaptcha: log risk score of null edits with other action than `edit` - https://phabricator.wikimedia.org/T410550 [08:51:41] (03PS1) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [08:53:15] (03CR) 10CI reject: [V:04-1] Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [08:53:36] (03CR) 10Gehel: "I'm not quite sure why the alert isn't generated in the test..." [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [09:00:05] brennen and andre: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T0900). [09:02:55] (03CR) 10Alexandros Kosiaris: [C:03+2] relforge: Clarify comment about cumin masters role [puppet] - 10https://gerrit.wikimedia.org/r/1207212 (owner: 10Alexandros Kosiaris) [09:06:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:11:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:24:18] (03PS1) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 [09:24:57] (03PS2) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [09:25:38] (03CR) 10Gehel: [C:04-1] "The tests are passing, but the check is done on absolute values, not irate()." [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [09:30:03] (03PS2) 10Alexandros Kosiaris: toolhub: make extraFQDNs specific to codfw, eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/954290 [09:33:57] (03CR) 10Bartosz Wójtowicz: ml-services: add new namespace to prod (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski) [09:38:57] (03PS2) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 [09:41:37] (03CR) 10Bartosz Wójtowicz: ml-services: add new namespace to prod (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski) [09:41:37] (03PS3) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 [09:43:21] (03PS1) 10Dpogorzelski: ml-services: add revise-tone-task-generator [puppet] - 10https://gerrit.wikimedia.org/r/1207800 [09:44:16] that was quick [09:45:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:38] (03PS1) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) [09:56:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T410589)', diff saved to https://phabricator.wikimedia.org/P85408 and previous config saved to /var/cache/conftool/dbconfig/20251120-095606-ladsgroup.json [09:56:12] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [09:58:14] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11391188 (10ayounsi) One liners: `lang=python >>> spicerack.redfish('sretest2004').scp_dump().components['NIC.Integrated.1-1-1'].get('Broadcom_LLDPNearestBridg... [09:59:03] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [09:59:50] (03PS7) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [10:00:39] (03PS2) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) [10:11:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P85409 and previous config saved to /var/cache/conftool/dbconfig/20251120-101114-ladsgroup.json [10:24:44] (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [10:26:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P85410 and previous config saved to /var/cache/conftool/dbconfig/20251120-102622-ladsgroup.json [10:33:25] (03PS2) 10Arnaudb: apt: add an alert on reprepro errors [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) [10:33:25] (03CR) 10Arnaudb: "this patch brings new alerts based on metrics introduced in 1205162 and 1206887" [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [10:37:26] (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [10:37:35] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:39:22] (03Merged) 10jenkins-bot: rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:41:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T410589)', diff saved to https://phabricator.wikimedia.org/P85411 and previous config saved to /var/cache/conftool/dbconfig/20251120-104129-ladsgroup.json [10:41:35] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:41:35] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [10:41:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T410589)', diff saved to https://phabricator.wikimedia.org/P85412 and previous config saved to /var/cache/conftool/dbconfig/20251120-104142-ladsgroup.json [10:43:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:46:44] hmm [10:47:05] Oh it's staging [10:47:31] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:47:52] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:49:47] arnaudb, effie, just a head's up I'm switching the rest-gateway's backends to dc-local, no expected impact but it will shift some traffic. I'll keep an eye on graphs. [10:49:58] cool [10:49:58] ack, thanks claime [10:50:37] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:50:56] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:51:31] (03PS3) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) [10:52:20] (03PS4) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) [10:53:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:59:06] (03PS1) 10DCausse: Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) [10:59:18] (03PS1) 10DCausse: Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207813 (https://phabricator.wikimedia.org/T410602) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1100) [11:00:09] (03PS1) 10David Caro: maintain_dbusers: parse the response before throwing [puppet] - 10https://gerrit.wikimedia.org/r/1207814 [11:02:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CirrusSearch] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207813 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse) [11:02:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse) [11:09:13] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:12:03] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:30:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:35:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:35:56] 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612 (10Blake) 03NEW [11:40:29] 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11391425 (10Clement_Goubert) p:05Triage→03Medium @Kappakayala and @hnowlan being OOO, @mark could I get approval for this please? [11:40:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:40:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:44:54] (03CR) 10FNegri: [C:03+1] maintain_dbusers: parse the response before throwing [puppet] - 10https://gerrit.wikimedia.org/r/1207814 (owner: 10David Caro) [11:45:23] (03CR) 10FNegri: [C:03+1] P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah) [11:45:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:45:51] (03CR) 10Majavah: [C:04-1] "This will lead to more unclear error messages when the error is not valid JSON, unfortunately." [puppet] - 10https://gerrit.wikimedia.org/r/1207814 (owner: 10David Caro) [11:46:14] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloud_private_subnet: Cleanup IPv6 conditions [puppet] - 10https://gerrit.wikimedia.org/r/1204627 (owner: 10Majavah) [11:48:10] (03CR) 10Clément Goubert: [C:03+1] mobileapps: revert to 2025-10-13-122439-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) (owner: 10Scott French) [11:48:45] 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11391457 (10Volans) [11:48:47] (03PS1) 10Blake: Add blake to ops, remove blake from ops-limited. [puppet] - 10https://gerrit.wikimedia.org/r/1207824 (https://phabricator.wikimedia.org/T410612) [11:50:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:51:17] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:53:55] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Looks good to me :) It handles ns creation on both staging and prod, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/1207800 (owner: 10Dpogorzelski) [11:55:33] (03PS6) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 [11:55:34] (03PS7) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [11:55:34] (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 [11:56:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:33] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7656/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [11:57:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:58:19] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah) [11:59:09] (03CR) 10Cathal Mooney: "Nice! LGTM, I'd say let's test it a bit and get Luca's input when he's back before merging but code looks good to me and makes sense." [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [12:01:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7655/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah) [12:02:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:02:24] (03PS2) 10Majavah: P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 [12:02:24] (03PS8) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [12:02:36] jouncebot: nowandnext [12:02:36] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [12:02:36] In 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1300) [12:02:51] I’d like to roll out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 if nobody objects (config cleanup, mostly a no-op) [12:03:56] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7657/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [12:04:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE)) [12:04:35] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah) [12:04:53] (03CR) 10CI reject: [V:04-1] P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [12:05:18] (03PS3) 10Majavah: P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 [12:05:18] (03PS9) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [12:05:20] (03Merged) 10jenkins-bot: tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE)) [12:05:54] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1207171|tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php (T410507)]] [12:05:59] T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507 [12:06:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:07:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:08:01] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198939 (https://phabricator.wikimedia.org/T408223) [12:09:42] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7658/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah) [12:10:10] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1207171|tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php (T410507)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:10:49] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [12:12:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:13:44] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [12:13:49] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [12:13:54] (03PS1) 10Bartosz Wójtowicz: ml-services: Enable Changeprop for revise-tone-task-generator staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538) [12:14:51] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207171|tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php (T410507)]] (duration: 08m 57s) [12:14:55] T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507 [12:16:20] * Lucas_WMDE done deploying [12:16:26] Lucas_WMDE: thanks again for deployment, i guess resetauthentication script, not needed since there's no IP address [12:16:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:16:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [12:16:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:17:54] anzx: yeah, it would give the same error as yesterday ^^ [12:18:01] I also updated the wikitech docs btw [12:19:14] thank you [12:21:08] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync [12:21:37] !log roll-restart of mobileapps codfw - T410296 [12:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:41] T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13 - https://phabricator.wikimedia.org/T410296 [12:22:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:22:16] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync [12:26:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [12:26:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [12:26:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:27:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:27:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:28:07] (03PS1) 10JMeybohm: Implement fetching of ipblock-source urls [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1207834 [12:28:58] (03CR) 10JMeybohm: [V:03+2 C:03+2] Implement fetching of ipblock-source urls [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1207834 (owner: 10JMeybohm) [12:31:02] !log jayme@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin2002 - T402014" [12:31:04] !log jayme@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin2002 - T402014 [12:31:06] T402014: Add ipblock-source objects and logic - https://phabricator.wikimedia.org/T402014 [12:31:46] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198939 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [12:31:55] !log jayme@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin2002 - T402014 [12:31:57] !log jayme@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin2002 - T402014" [12:32:43] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11391704 (10cmooney) >>! In T408892#11389081, @Papaul wrote: > I think a am wrong on the public vlan for rack 22. We will not be re-imaging the servers in that... [12:32:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:32:45] (03CR) 10Sergio Gimeno: [C:03+1] [Growth] Enable Add Link task pool generation for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206948 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm) [12:37:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:38:57] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:39:35] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: SystemdUnitCrashLoop (instance grafana2001:9100) - https://phabricator.wikimedia.org/T410619 (10LSobanski) 03NEW [12:41:26] (03PS1) 10JMeybohm: P:conftool::requestctl_client: update requestctl_cli.original.py [puppet] - 10https://gerrit.wikimedia.org/r/1207844 (https://phabricator.wikimedia.org/T402014) [12:42:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:44:34] (03CR) 10LWatson: [C:03+1] Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [12:45:38] (03PS1) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) [12:51:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11391814 (10cmooney) [12:52:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:53:24] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:54:41] (03PS1) 10JMeybohm: fetch_external_clouds_vendors_nets.py: ipblock-source support [puppet] - 10https://gerrit.wikimedia.org/r/1207848 (https://phabricator.wikimedia.org/T402014) [12:54:50] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler) [12:56:14] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), and 4 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11391847 (10Clement_Goubert) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1300) [13:01:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:03:30] (03CR) 10Klausman: ml-services: add new namespace to prod (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski) [13:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:21:23] (03CR) 10Filippo Giunchedi: "LGTM, only docs to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah) [13:22:02] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, modulo I5891d5367 of course" [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah) [13:23:25] (03PS7) 10Majavah: interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 [13:23:25] (03PS4) 10Majavah: P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 [13:23:25] (03PS10) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [13:23:46] (03PS4) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 [13:25:07] (03CR) 10Majavah: interface::route: Support passing in a CIDR directly (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah) [13:25:07] (03PS5) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 [13:25:13] (03CR) 10Majavah: [C:03+2] interface::route: Support passing in a CIDR directly [puppet] - 10https://gerrit.wikimedia.org/r/1196368 (owner: 10Majavah) [13:26:24] (03CR) 10Majavah: [C:03+2] P:wmcs::cloud_private_subnet: Simplify route definitions [puppet] - 10https://gerrit.wikimedia.org/r/1207826 (owner: 10Majavah) [13:27:11] (03PS6) 10Dpogorzelski: ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 [13:28:23] (03PS3) 10Jcrespo: garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) [13:28:59] (03CR) 10Dpogorzelski: ml-services: add new namespace to prod (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski) [13:29:55] (03CR) 10Jcrespo: garage: Productionize garage (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [13:30:21] (03CR) 10Jcrespo: garage: Productionize garage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [13:30:31] (03CR) 10Dbrant: [C:03+2] mobileapps: revert to 2025-10-13-122439-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) (owner: 10Scott French) [13:31:49] (03CR) 10Dpogorzelski: [C:03+2] ml-services: add revise-tone-task-generator [puppet] - 10https://gerrit.wikimedia.org/r/1207800 (owner: 10Dpogorzelski) [13:32:12] jouncebot: nowandnext [13:32:12] For the next 0 hour(s) and 27 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1300) [13:32:12] In 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1400) [13:32:14] (03Merged) 10jenkins-bot: mobileapps: revert to 2025-10-13-122439-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) (owner: 10Scott French) [13:33:16] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:33:33] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:33:56] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:34:43] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:34:52] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:35:34] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:42:41] (03PS1) 10Marostegui: data.yaml: Add FIDO key for marostegui [puppet] - 10https://gerrit.wikimedia.org/r/1207863 [13:45:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:39] (03PS1) 10Slyngshede: data.yaml: Offboarding roti [puppet] - 10https://gerrit.wikimedia.org/r/1207864 (https://phabricator.wikimedia.org/T410628) [13:47:22] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11392028 (10ayounsi) @papaul, could you have a look at the BIOS of sretest1005 ? The matching Redfish keys don't exist :( `lang=python >>> dump3.components['N... [13:50:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:51:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11392041 (10cmooney) [13:51:48] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1207864 (https://phabricator.wikimedia.org/T410628) (owner: 10Slyngshede) [13:51:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [13:52:17] (03CR) 10Ssingh: [C:03+1] hcaptcha_proxy: remove unused parameters [labs/private] - 10https://gerrit.wikimedia.org/r/1207265 (owner: 10Kamila Součková) [13:52:53] (03PS1) 10Jforrester: Undeploy the WikimediaEditorTasks extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) [13:54:27] (03CR) 10Kamila Součková: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1207265 (owner: 10Kamila Součková) [13:54:45] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding roti [puppet] - 10https://gerrit.wikimedia.org/r/1207864 (https://phabricator.wikimedia.org/T410628) (owner: 10Slyngshede) [13:56:53] (03CR) 10Ssingh: [C:03+1] hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková) [13:56:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [13:57:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:57:53] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski) [13:57:55] I won’t be around during the beginning of today’s backport window btw; I might be able to deploy partway through if needed [13:58:24] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:59:34] (03CR) 10Ssingh: [C:03+2] O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1400). [14:00:04] Daimona, JSherman, and dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] here [14:00:23] multiple changes pending [14:00:31] ack [14:00:33] slyngs: dpogorzelski: ok to merge yours? [14:00:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:00:44] o/ [14:00:51] Go a head [14:00:56] thanks [14:01:17] dpogorzelski: merging your as well [14:01:33] I ping -ml for that one. It's probably fine :-) [14:02:06] it's my fault if it wasn't supposed to be merged. usually we blame Fabrizio for anything but he is not around so it's me [14:02:55] (03CR) 10AikoChou: "liftwing_streams.yaml is mainly for testing rendering when we update charts, so adding it there is optional but a good practice :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [14:02:56] o/ [14:03:13] (03CR) 10Dbrant: [C:03+1] Undeploy the WikimediaEditorTasks extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester) [14:03:57] (03CR) 10Ssingh: [C:03+2] site.pp: reimage hcaptcha-proxy1001 to proper role [puppet] - 10https://gerrit.wikimedia.org/r/1207165 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [14:05:15] (03CR) 10Kamila Součková: [C:03+2] hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková) [14:05:47] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy1001.wikimedia.org with OS bookworm [14:05:47] I can try doing the backport [14:06:00] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11392104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [14:06:09] (03CR) 10Kamila Součková: [V:03+2 C:03+2] hcaptcha_proxy: remove unused parameters [labs/private] - 10https://gerrit.wikimedia.org/r/1207265 (owner: 10Kamila Součková) [14:06:12] Daimona: prepare 🍿 [14:06:16] (03CR) 10Ladsgroup: [C:03+2] Enable $wgCampaignEventsEnableContributionTracking in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206964 (https://phabricator.wikimedia.org/T404904) (owner: 10Daimona Eaytoy) [14:06:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:06:45] ready 🤌 [14:07:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:07:04] (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableContributionTracking in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206964 (https://phabricator.wikimedia.org/T404904) (owner: 10Daimona Eaytoy) [14:07:13] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:09:21] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1206964|Enable $wgCampaignEventsEnableContributionTracking in production (T404904)]] [14:09:26] T404904: Release Collaborative Contributions MVP to all wikis with CampaignEvents extension - NOV 20 - https://phabricator.wikimedia.org/T404904 [14:14:57] !log ladsgroup@deploy2002 daimona, ladsgroup: Backport for [[gerrit:1206964|Enable $wgCampaignEventsEnableContributionTracking in production (T404904)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:15:01] T404904: Release Collaborative Contributions MVP to all wikis with CampaignEvents extension - NOV 20 - https://phabricator.wikimedia.org/T404904 [14:15:29] Daimona: live in mwdebug [14:16:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:17:46] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy1001.wikimedia.org with reason: host reimage [14:21:12] Daimona: ping :P [14:21:17] RESOLVED: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:26] Yeah I was testing, it's hard to do it quickly with popcorns in one hand :P [14:21:31] lol [14:21:36] sorry, I didn't get ACK [14:21:42] I thought you didn't get it [14:22:02] RESOLVED: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:22:05] No sorry, I should've said so explicitly I just thought it'd be easier [14:22:16] Anyway, it seems to be mostly working [14:22:18] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:22:38] The exception being that my edit has not been recorded. Might be that the jobqueue is lagged or something, or perhaps an mwdebug-only thing [14:23:05] Or potentially some cache [14:23:10] I think we can go ahead and I'll test again later [14:23:25] (I don't see any exceptions in logstash either, that's why I'm kinda confident) [14:23:51] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy1001.wikimedia.org with reason: host reimage [14:24:11] !log ladsgroup@deploy2002 daimona, ladsgroup: Continuing with sync [14:27:31] (03PS3) 10Ayounsi: Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) [14:27:54] (03CR) 10Ayounsi: Interface validators: prevent more mistakes on interface naming (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [14:28:15] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206964|Enable $wgCampaignEventsEnableContributionTracking in production (T404904)]] (duration: 18m 53s) [14:28:19] T404904: Release Collaborative Contributions MVP to all wikis with CampaignEvents extension - NOV 20 - https://phabricator.wikimedia.org/T404904 [14:31:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11392187 (10cmooney) [14:32:01] (03CR) 10Ladsgroup: [C:03+2] Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207813 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse) [14:32:06] (03CR) 10Ladsgroup: [C:03+2] Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse) [14:33:20] (03Merged) 10jenkins-bot: Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207813 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse) [14:34:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse) [14:34:20] (03Merged) 10jenkins-bot: Fix filtering of relevant default sort suggestions [extensions/CirrusSearch] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207812 (https://phabricator.wikimedia.org/T410602) (owner: 10DCausse) [14:34:53] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207813|Fix filtering of relevant default sort suggestions (T410602)]], [[gerrit:1207812|Fix filtering of relevant default sort suggestions (T410602)]] [14:34:55] FIRING: [3x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:58] T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602 [14:35:20] (03CR) 10Ladsgroup: [C:04-1] "Please enable new features gradually. First testwikis and small wikis, then bigger number of wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [14:36:20] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Robert Timm out of all services on: 2413 hosts [14:36:41] Amir1: not all of these wikis have the extension enabled; this is preparatory work that impacts a community config form [14:37:15] Amir1: we have run comms with impacted wikis already [14:37:17] still better to do it small wikis first [14:37:23] (03CR) 10Dpogorzelski: [C:03+2] ml-services: add new namespace to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207798 (owner: 10Dpogorzelski) [14:37:25] it's not about comms [14:37:31] it's about bugs and issues [14:37:58] please share which wikis you are concerned about [14:38:53] any wiki listed in https://noc.wikimedia.org/conf/highlight.php?file=dblists/large.dblist [14:40:14] !log ladsgroup@deploy2002 ladsgroup, dcausse: Backport for [[gerrit:1207813|Fix filtering of relevant default sort suggestions (T410602)]], [[gerrit:1207812|Fix filtering of relevant default sort suggestions (T410602)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:40:18] T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602 [14:40:29] dcausse: it's live in mwdebug, can you test? [14:40:35] Amir1: yes, testing [14:40:41] Awesome! [14:41:01] Amir1: all good! [14:41:20] !log ladsgroup@deploy2002 ladsgroup, dcausse: Continuing with sync [14:41:26] moving forward \o/ [14:42:03] Amir1: I am extremely dissatisfied with this outcome; you are not a listed deployer for this window, and I have the same deploy privileges as you. I could have just self-deployed this [14:42:19] This is a low impact, low risk chane [14:43:11] currently enabled wikis are: [14:43:11] 'testwiki' [14:43:12] 'trwiki' [14:43:12] 'idwiki' [14:43:12] 'ukwiki' [14:43:12] 'viwiki' [14:43:12] 'afwiki' [14:43:13] 'bnwiki' [14:43:13] 'azwiki' [14:43:14] 'zhwiki [14:43:14] 'eswiki' [14:43:20] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1400 [14:43:27] > Your patch may or may not be deployed at the sole discretion of the deployer [14:43:52] Sure, but you just jumped in and named yourself deployer [14:43:54] if you have anyone else from the window willing to deploy this, go ahead [14:44:01] I am willing [14:44:54] deploying [14:44:55] (03PS1) 10Ssingh: Revert "hcaptcha_proxy: remove unused parameters" [puppet] - 10https://gerrit.wikimedia.org/r/1207873 [14:45:02] > Deployer: Lucas (Lucas_WMDE), Martin (Urbanecm), Sammy (TheresNoTime) [14:45:19] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207813|Fix filtering of relevant default sort suggestions (T410602)]], [[gerrit:1207812|Fix filtering of relevant default sort suggestions (T410602)]] (duration: 10m 25s) [14:45:19] if any of them are happy with deploying, I have no objection [14:45:23] T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602 [14:45:40] Amir1: thanks for the deploy! [14:45:45] no worries! [14:45:46] [FTR: I tested my change now that it's live and things are working as expected. Thanks Amir1!] [14:45:52] Awesome! [14:46:38] Amir1: how is it that you can say no, but only they can say yes? [14:47:01] (03PS1) 10Slyngshede: C:tomcat10 hide stacktrace and server info [puppet] - 10https://gerrit.wikimedia.org/r/1207874 [14:47:04] no, it's seeking a third party opinion [14:47:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11392255 (10bking) a:05bking→03None [14:47:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11392256 (10bking) a:05bking→03None [14:48:47] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers hcaptcha1002.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:48:53] (03CR) 10Kamila Součková: [C:03+1] Revert "hcaptcha_proxy: remove unused parameters" [puppet] - 10https://gerrit.wikimedia.org/r/1207873 (owner: 10Ssingh) [14:49:03] (03CR) 10Ssingh: [C:03+2] Revert "hcaptcha_proxy: remove unused parameters" [puppet] - 10https://gerrit.wikimedia.org/r/1207873 (owner: 10Ssingh) [14:49:22] !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php enwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize # T410602 reindexing search suggestions on enwiki [14:49:24] !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php enwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize # T410602 reindexing search suggestions on enwiki [14:49:43] Amir1: I feel extremely surprised and frustrated with this situation. I have never had an unlisted deployer come in and hold up a deployment at the last minute like this before. But, okay. I'll go take a breath. I do appreciate that you are doing what you think is best. [14:49:55] FIRING: [4x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:03] <_joe_> JSherman: frankly, your attitude is unacceptable. [14:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:51:35] (03PS1) 10Cathal Mooney: lvs1020: move row C vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207877 (https://phabricator.wikimedia.org/T405609) [14:51:43] <_joe_> if a fellow developer, and in this case someone who is actively getting paged for emergencies, expresses doubts about a deployment, it's not ok to try to bulldoze through it. [14:51:43] I am not unlisted deployer. I'm in the morning list (https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_November_20) and have been deploying for around eight years now. I've taken over since Lucas explicitly that he can't do it today [14:51:47] Amir1: I apologize for my response to this [14:52:01] <_joe_> Amir1: please avoid further interactions, what needed to be said was said [14:52:11] yeah, fair [14:52:28] <_joe_> JSherman: ok, all good then <4 [14:52:32] <_joe_> err *<3 [14:53:37] !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php frwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize # T410602 reindexing search suggestions on frwiki [14:53:42] T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602 [14:53:48] I'll go touch grass, I clearly did not handle this well [14:54:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T410589)', diff saved to https://phabricator.wikimedia.org/P85415 and previous config saved to /var/cache/conftool/dbconfig/20251120-145439-ladsgroup.json [14:54:45] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:54:47] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:54:55] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:22] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy1001.wikimedia.org with OS bookworm [14:55:36] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11392286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [14:56:58] (03PS1) 10Ladsgroup: Revert^2 "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 [14:57:03] jouncebot: nowandnext [14:57:04] For the next 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1400) [14:57:04] In 0 hour(s) and 32 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1530) [15:00:19] (03PS1) 10Blake: Add a script and systemd unit to monitor for keystore updates. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) [15:00:48] (03CR) 10CI reject: [V:04-1] Add a script and systemd unit to monitor for keystore updates. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [15:03:36] (03PS2) 10Blake: Add a script and systemd unit to monitor for keystore updates. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) [15:04:04] (03CR) 10CI reject: [V:04-1] Add a script and systemd unit to monitor for keystore updates. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [15:04:07] (03CR) 10Ladsgroup: [C:03+2] Revert^2 "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 (owner: 10Ladsgroup) [15:04:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 (owner: 10Ladsgroup) [15:05:55] (03PS4) 10Jcrespo: garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) [15:06:59] (03CR) 10Tiziano Fogli: "Just tested on Pontoon @cwhite@wikimedia.org, sorry for the lag, it took me a while ... I’ve added an inline comment... and thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite) [15:08:24] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P85416 and previous config saved to /var/cache/conftool/dbconfig/20251120-150946-ladsgroup.json [15:12:00] (03PS1) 10Sergio Gimeno: EventStreamConfig: drop revision and namespace id from contributors.experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) [15:13:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:14:17] !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php hewiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize # T410602 reindexing search suggestions on hewiki [15:14:21] T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602 [15:14:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:18:29] (03Merged) 10jenkins-bot: Revert^2 "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 (owner: 10Ladsgroup) [15:18:46] jouncebot: next [15:18:46] In 0 hour(s) and 11 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1530) [15:19:01] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11392432 (10cmooney) [15:19:01] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207878|Revert^2 "rdbms: Dismantle concept of groups"]] [15:19:21] !log ladsgroup@deploy2002 sync-world failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.RV8eoygq6j']' returned [15:19:21] non-zero exit status 255. (scap version: 4.227.0) (duration: 00m 20s) [15:19:47] (03PS1) 10TrainBranchBot: Revert "Revert^2 "rdbms: Dismantle concept of groups"" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207884 [15:19:47] (03CR) 10TrainBranchBot: "ladsgroup@deploy2002 created a revert of this change as I12ca322e5f483632714b196ee67c6849ab2bc9d6" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207878 (owner: 10Ladsgroup) [15:19:55] :/ [15:20:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207884 (owner: 10TrainBranchBot) [15:22:46] It's okay 😁 [15:23:28] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11392495 (10MatthewVernon) [15:24:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P85417 and previous config saved to /var/cache/conftool/dbconfig/20251120-152454-ladsgroup.json [15:25:20] (03PS2) 10JMeybohm: fetch_external_clouds_vendors_nets.py: ipblock-source support [puppet] - 10https://gerrit.wikimedia.org/r/1207848 (https://phabricator.wikimedia.org/T402014) [15:26:03] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:26:31] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1530) [15:30:39] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:31:26] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:32:35] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:33:21] (03PS5) 10Jcrespo: garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) [15:33:21] (03PS1) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [15:33:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:12] (03CR) 10CI reject: [V:04-1] garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:34:20] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:34:31] (03Merged) 10jenkins-bot: Revert "Revert^2 "rdbms: Dismantle concept of groups"" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207884 (owner: 10TrainBranchBot) [15:34:40] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11392629 (10MatthewVernon) [15:34:47] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:35:06] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207884|Revert "Revert^2 "rdbms: Dismantle concept of groups""]] [15:35:11] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [15:35:43] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:36:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11392634 (10cmooney) [15:36:43] (03PS2) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [15:37:19] (03CR) 10CI reject: [V:04-1] garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:38:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:39:01] (03PS1) 10Dpogorzelski: ml-services: fix ns creation parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207888 [15:39:09] (03CR) 10Dpogorzelski: [C:03+2] ml-services: fix ns creation parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207888 (owner: 10Dpogorzelski) [15:39:18] !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php frwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize # T410602 reindexing search suggestions on frwiki [15:39:23] T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602 [15:39:24] !log ladsgroup@deploy2002 ladsgroup, trainbranchbot: Backport for [[gerrit:1207884|Revert "Revert^2 "rdbms: Dismantle concept of groups""]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:40:02] !log ladsgroup@deploy2002 ladsgroup, trainbranchbot: Continuing with sync [15:40:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T410589)', diff saved to https://phabricator.wikimedia.org/P85418 and previous config saved to /var/cache/conftool/dbconfig/20251120-154002-ladsgroup.json [15:40:07] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:40:08] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:40:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T410589)', diff saved to https://phabricator.wikimedia.org/P85419 and previous config saved to /var/cache/conftool/dbconfig/20251120-154014-ladsgroup.json [15:40:58] (03Merged) 10jenkins-bot: ml-services: fix ns creation parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207888 (owner: 10Dpogorzelski) [15:43:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:44:17] (03PS3) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [15:44:55] (03CR) 10CI reject: [V:04-1] garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:45:00] (03PS1) 10Cathal Mooney: lvs1019: move row D vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207891 (https://phabricator.wikimedia.org/T405628) [15:45:16] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207884|Revert "Revert^2 "rdbms: Dismantle concept of groups""]] (duration: 10m 09s) [15:46:05] (03CR) 10Michael Große: [C:03+1] "Makes sense to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [15:46:14] (03PS4) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [15:46:22] Amir1: I want to be a little more specific in my apology from earlier; I let my frustration get the best of me and I did not respond to you with humility and an open mind. I was not following the golden rule! I was holding on to the idea that I was right and you were wrong. Sitting in the deployer seat means saying no to things sometimes. I can't go back in time and respond better, but I can do better next time. [15:46:38] (03PS2) 10Cathal Mooney: lvs1019: move row D vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207891 (https://phabricator.wikimedia.org/T405628) [15:47:28] (03CR) 10Clare Ming: [C:03+1] EventStreamConfig: drop revision and namespace id from contributors.experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [15:48:03] (03PS1) 10Bking: opensearch on k8s: stop hard-coding JVM memory settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207893 (https://phabricator.wikimedia.org/T405361) [15:49:03] (03PS2) 10Bking: opensearch on k8s: stop hard-coding JVM memory settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207893 (https://phabricator.wikimedia.org/T405361) [15:50:07] (03CR) 10Scott French: [C:03+1] cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [15:52:39] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:53:20] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:55:20] (03CR) 10Gehel: [C:03+1] opensearch on k8s: stop hard-coding JVM memory settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207893 (https://phabricator.wikimedia.org/T405361) (owner: 10Bking) [15:55:37] (03PS1) 10Cwhite: loki: remove loki rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619) [15:55:43] !log mforns@deploy2002 Started deploy [analytics/refinery@4df475f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4df475f3] [15:55:47] (03CR) 10Bking: [C:03+2] opensearch on k8s: stop hard-coding JVM memory settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207893 (https://phabricator.wikimedia.org/T405361) (owner: 10Bking) [15:56:45] !log mforns@deploy2002 Finished deploy [analytics/refinery@4df475f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4df475f3] (duration: 01m 01s) [15:57:01] !log mforns@deploy2002 Started deploy [analytics/refinery@4df475f]: Regular analytics weekly train [analytics/refinery@4df475f3] [15:57:17] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [15:57:23] (03CR) 10CI reject: [V:04-1] loki: remove loki rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619) (owner: 10Cwhite) [15:58:23] (03PS2) 10Cwhite: loki: remove loki rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619) [15:58:49] (03CR) 10RLazarus: [C:03+2] kubernetes: Set default Envoy version to 1.32.12 [puppet] - 10https://gerrit.wikimedia.org/r/1207289 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [15:59:26] !log mforns@deploy2002 Finished deploy [analytics/refinery@4df475f]: Regular analytics weekly train [analytics/refinery@4df475f3] (duration: 02m 25s) [16:00:05] brennen and andre: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1600). [16:00:26] !log dcausse@deploy2002 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php hewiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize=3000 --cluster=eqiad --optimize # T410602 reindexing search suggestions on hewiki [16:00:30] T410602: CirrusSearch metadata stores DEFAULTSORT overrides even after they've been removed - https://phabricator.wikimedia.org/T410602 [16:01:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11392787 (10Jhancock.wm) logged in to idrac to check. so far so good. if it doesn't alert by monday, we should be able to close the ticket. [16:03:26] (03PS1) 10Ssingh: hiera: hcaptcha/proxy: fix healthchecks for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1207896 (https://phabricator.wikimedia.org/T409780) [16:03:59] !log mforns@deploy2002 Started deploy [analytics/refinery@4df475f] (thin): Regular analytics weekly train THIN [analytics/refinery@4df475f3] [16:05:16] !log mforns@deploy2002 Finished deploy [analytics/refinery@4df475f] (thin): Regular analytics weekly train THIN [analytics/refinery@4df475f3] (duration: 01m 16s) [16:07:23] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619) (owner: 10Cwhite) [16:10:54] (03PS1) 10Ssingh: Revert "hcaptcha_proxy: remove unused parameters" [labs/private] - 10https://gerrit.wikimedia.org/r/1207897 [16:12:22] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [16:14:07] (03CR) 10Cwhite: [C:03+2] "PCC LGTM: https://puppet-compiler.wmflabs.org/output/1207894/5335/" [puppet] - 10https://gerrit.wikimedia.org/r/1207894 (https://phabricator.wikimedia.org/T410619) (owner: 10Cwhite) [16:16:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: lvs1018: decom links to asw2-c2-eqiad and asw2-d7-eqiad - https://phabricator.wikimedia.org/T410661 (10cmooney) 03NEW p:05Triage→03Medium [16:16:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: lvs1018: decom links to asw2-c2-eqiad and asw2-d7-eqiad - https://phabricator.wikimedia.org/T410661#11392925 (10cmooney) [16:16:19] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11392926 (10cmooney) [16:16:42] !log rebooting sretest1005 to chek LLDP settings [16:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:09] PROBLEM - Host sretest1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:16] (03CR) 10Ahmon Dancy: [C:03+1] "This is ready to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [16:18:18] (03CR) 10Ahmon Dancy: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [16:18:34] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11392932 (10MatthewVernon) [16:19:43] RECOVERY - Host sretest1005 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [16:22:45] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11392975 (10Papaul) @ayounsi sretest1005 is the same as 2004 see below. what you can maybe check is the redfish /IDRAC version on sretest2004 and 1005 {F703... [16:28:45] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11393052 (10MatthewVernon) [16:29:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11393076 (10RobH) I've set a gcal event for 2025-12003 @ 10AM EST / 15:00 GMT for the alert1002 migration. [16:31:30] (03PS1) 10Ssingh: P:hcaptcha::proxy: do not restart nginx (do reload) [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) [16:32:03] 07sre-alert-triage, 10Observability-Logging, 06SRE Observability (FY2025/2026-Q2): Alert in need of triage: SystemdUnitCrashLoop (instance grafana2001:9100) - https://phabricator.wikimedia.org/T410619#11393094 (10colewhite) 05Open→03Resolved a:03colewhite I've removed the rsync job that I suspect w... [16:32:28] (03CR) 10Ssingh: [V:03+2 C:03+2] Revert "hcaptcha_proxy: remove unused parameters" [labs/private] - 10https://gerrit.wikimedia.org/r/1207897 (owner: 10Ssingh) [16:32:32] (03PS2) 10Stevemunene: LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222) [16:33:46] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [16:33:58] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7661/console" [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:35:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1005.eqiad.wmnet [16:35:42] (03CR) 10Ssingh: "Yeah sorry Ben. I will take care of this today." [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [16:37:36] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T410667 (10hold_your_horses) 03NEW [16:37:57] 06SRE, 10LDAP-Access-Requests: Grant Access to GitLab for holdyourhorses - https://phabricator.wikimedia.org/T410667#11393129 (10hold_your_horses) [16:38:57] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:43:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1005.eqiad.wmnet [16:43:11] (03CR) 10Kamila Součková: [C:03+1] hiera: hcaptcha/proxy: fix healthchecks for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1207896 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:43:46] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [16:46:55] (03PS1) 10Ssingh: site.pp: reimage all new hcaptcha proxies to the right role [puppet] - 10https://gerrit.wikimedia.org/r/1207908 [16:48:03] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:48:33] (03CR) 10Ssingh: [C:03+2] hiera: hcaptcha/proxy: fix healthchecks for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1207896 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:49:06] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11393301 (10Clement_Goubert) @KOfori Could you approve this ? [16:49:09] (03CR) 10Dzahn: gerrit: add a local backup cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [16:49:17] (03CR) 10Dzahn: [C:03+1] gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [16:49:44] (03CR) 10Dzahn: "thanks, well. let me add Mateus first :)" [puppet] - 10https://gerrit.wikimedia.org/r/1207304 (owner: 10Dzahn) [16:49:56] (03CR) 10Dzahn: admin: deprecate the releasers-blubber group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [16:50:04] (03CR) 10Kamila Součková: [C:03+1] P:hcaptcha::proxy: do not restart nginx (do reload) [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:50:37] (03CR) 10Kamila Součková: [C:03+1] site.pp: reimage all new hcaptcha proxies to the right role [puppet] - 10https://gerrit.wikimedia.org/r/1207908 (owner: 10Ssingh) [16:52:37] (03PS4) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) [16:53:11] PROBLEM - Host sretest1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:41] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1006.eqiad.wmnet [16:54:53] (03PS1) 10Majavah: P:wmcs::cloudgw: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1207912 [16:54:53] (03PS1) 10Majavah: P:wmcs::cloudgw: Cleanup natlog feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1207913 [16:54:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:54:59] !log draining eqiad d3 wikikube hosts for network migration [16:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:54] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1034.eqiad.wmnet [16:56:12] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1159,1162-1163].eqiad.wmnet [16:56:29] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1034.eqiad.wmnet [16:56:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393364 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1034.eqiad.wmnet completed: - wikikube-worke... [16:56:39] RECOVERY - Host sretest1005 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:58:00] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1159,1162-1163].eqiad.wmnet [16:58:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393373 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1159,1162-1163].eqiad.wmnet completed: - wi... [16:58:15] (03PS2) 10Dzahn: admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313 [16:58:24] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11393374 (10ayounsi) Thanks, yeah that must be the reason : ` >>> spicerack.redfish('sretest1005').hw_model 9 >>> spicerack.redfish('sretest2004').hw_model 9 >... [16:59:07] (03CR) 10CI reject: [V:04-1] admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [17:00:05] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1700). [17:00:06] No Gerrit patches in the queue for this window AFAICS. [17:00:17] (03PS3) 10Dzahn: admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313 [17:01:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1006.eqiad.wmnet [17:02:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:03:38] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1159.eqiad.wmnet with reason: C/D Migration [17:04:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:05:53] (03CR) 10Ssingh: [V:03+1 C:03+2] P:hcaptcha::proxy: do not restart nginx (do reload) [puppet] - 10https://gerrit.wikimedia.org/r/1207905 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [17:06:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:10:02] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1162.eqiad.wmnet with reason: C/D Migration [17:10:03] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1163.eqiad.wmnet with reason: C/D Migration [17:10:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1034.eqiad.wmnet with reason: C/D Migration [17:11:34] (03CR) 10Ssingh: [C:03+2] site.pp: reimage all new hcaptcha proxies to the right role [puppet] - 10https://gerrit.wikimedia.org/r/1207908 (owner: 10Ssingh) [17:11:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:12:16] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1034.eqiad.wmnet [17:12:19] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1034.eqiad.wmnet [17:12:23] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1159,1162-1163].eqiad.wmnet [17:12:28] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1159,1162-1163].eqiad.wmnet [17:12:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393442 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1034.eqiad.wmnet completed: - wikikube-worker1... [17:12:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393444 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1159,1162-1163].eqiad.wmnet completed: - wiki... [17:12:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [17:13:48] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy1002.wikimedia.org with OS bookworm [17:14:02] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [17:14:16] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy2001.wikimedia.org with OS bookworm [17:14:34] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [17:14:36] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy2002.wikimedia.org with OS bookworm [17:14:50] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [17:15:22] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet [17:15:28] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet [17:16:05] !log eqiad wikikube d3 repooled, depooling d8 wikikube hosts [17:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:57] robh: i see some wikikube operations, is it a good idea to do a mediawiki deployment now? or should i wait? [17:17:14] my understanding is it shouldn't matter but maybe wait until the depool commands complete [17:17:22] shouldn't be more htan 5 minutes [17:17:32] can ping you when its done, i see no issue doing a deploy once they're fully depooled [17:17:42] its 1 rack out of 8 racks of wikikube heh [17:17:46] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1207915 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [17:17:58] okay, will do [17:18:03] ty! [17:18:14] the waiting for depool is likely me being paranoid but thats part of the job description, thanks for checking! [17:18:29] given it should be 5 mins, i'll start the CI though [17:18:35] definitely [17:18:40] but i'll wait for ping before actually touching prod [17:18:53] (03PS1) 10Urbanecm: hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207919 (https://phabricator.wikimedia.org/T410666) [17:19:01] (03PS1) 10Urbanecm: hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207920 (https://phabricator.wikimedia.org/T410666) [17:19:08] (03CR) 10Urbanecm: [C:03+2] hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207920 (https://phabricator.wikimedia.org/T410666) (owner: 10Urbanecm) [17:19:12] (03CR) 10Urbanecm: [C:03+2] hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207919 (https://phabricator.wikimedia.org/T410666) (owner: 10Urbanecm) [17:20:08] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet [17:20:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393479 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet... [17:20:19] thats 1 of 2 depools done [17:20:42] i split my command into 2 cuz i didn't feel like making a regex expression 100 characters long ; D [17:20:57] sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad depool wikikube-worker10[04,19,20,37,67,68,69,70,71,96,97].eqiad.wmnet is long enough =P [17:20:57] T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950 [17:21:42] wikikube-worker1037 is being slow to evict all its pods [17:22:16] no worries, still waiting :) [17:22:23] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet [17:22:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393495 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eq... [17:22:30] urbanecm: ^ go for it =] [17:22:35] depools complete [17:22:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:22:51] ty! have an ETA 6 on CI :) [17:22:56] do you want me to ping you once i'm done? [17:23:34] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1167.eqiad.wmnet with reason: C/D Migration [17:24:34] (03PS5) 10Jsn.sherman: Enable rr-ml AutoModerator CC form on !large wikis Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [17:24:42] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1168.eqiad.wmnet with reason: C/D Migration [17:25:02] (03PS1) 10Ssingh: hiera: common.yaml: add hcaptcha to all sites [puppet] - 10https://gerrit.wikimedia.org/r/1207922 (https://phabricator.wikimedia.org/T409780) [17:25:33] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1107.eqiad.wmnet with reason: C/D Migration [17:26:12] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1107.eqiad.wmnet with reason: C/D Migration [17:26:27] (03CR) 10Jsn.sherman: "Alrighty, I've removed large wikis with `while read -r db; do composer manage-dblist del $db revertrisk-multilingual; done < dblists/large" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [17:26:42] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1108.eqiad.wmnet with reason: C/D Migration [17:27:45] (03CR) 10Urbanecm: "question: if the goal is to enable this on all non-large wikis, why is this not a dbexpr instead? that way, you can formulate something li" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [17:27:49] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy1002.wikimedia.org with reason: host reimage [17:28:31] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1109.eqiad.wmnet with reason: C/D Migration [17:29:23] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1110.eqiad.wmnet with reason: C/D Migration [17:30:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1004.eqiad.wmnet with reason: C/D Migration [17:31:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1019.eqiad.wmnet with reason: C/D Migration [17:32:03] (03Merged) 10jenkins-bot: hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207920 (https://phabricator.wikimedia.org/T410666) (owner: 10Urbanecm) [17:32:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:32:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1096.eqiad.wmnet with reason: C/D Migration [17:32:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:33:01] (03Merged) 10jenkins-bot: hotfix: Disable Urdu alias for Special:Homepage [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207919 (https://phabricator.wikimedia.org/T410666) (owner: 10Urbanecm) [17:33:27] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy2002.wikimedia.org with reason: host reimage [17:33:29] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy1002.wikimedia.org with reason: host reimage [17:33:30] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy2001.wikimedia.org with reason: host reimage [17:34:09] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1097.eqiad.wmnet with reason: C/D Migration [17:36:38] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1020.eqiad.wmnet with reason: C/D Migration [17:36:42] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1207920|hotfix: Disable Urdu alias for Special:Homepage (T410666)]], [[gerrit:1207919|hotfix: Disable Urdu alias for Special:Homepage (T410666)]] [17:36:46] T410666: Visiting Special:Homepage is not recognised on urwiki - https://phabricator.wikimedia.org/T410666 [17:36:51] (03CR) 10Scott French: [C:03+1] "Thanks for highlighting that enabling the downstream connection limit is new." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203194 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [17:37:15] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy2002.wikimedia.org with reason: host reimage [17:37:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:37:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:38:39] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1037.eqiad.wmnet with reason: C/D Migration [17:39:42] (03CR) 10CDanis: Hive: alert when query rate is too high (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [17:39:59] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1164.eqiad.wmnet with reason: C/D Migration [17:41:03] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy2001.wikimedia.org with reason: host reimage [17:42:13] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1165.eqiad.wmnet with reason: C/D Migration [17:42:23] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1067.eqiad.wmnet with reason: C/D Migration [17:42:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:44:01] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1068.eqiad.wmnet with reason: C/D Migration [17:44:22] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1069.eqiad.wmnet with reason: C/D Migration [17:45:16] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1070.eqiad.wmnet with reason: C/D Migration [17:45:47] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1071.eqiad.wmnet with reason: C/D Migration [17:45:50] 06SRE, 10Phabricator: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent - https://phabricator.wikimedia.org/T410572#11393654 (10Volans) Adding #collaboration-services [17:46:11] (03PS1) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) [17:46:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:48:54] (03CR) 10Scott French: [C:03+1] "Pleasantly surprised to see the stock admin config already adds `ignore_global_conn_limit`, so you don't have to deal with that here." [puppet] - 10https://gerrit.wikimedia.org/r/1203195 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [17:49:01] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy1002.wikimedia.org with OS bookworm [17:49:13] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393663 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [17:50:09] (03CR) 10Scott French: [C:03+1] "Thanks for updating this!" [puppet] - 10https://gerrit.wikimedia.org/r/1207844 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [17:50:50] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet [17:50:51] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet [17:50:57] !log wikikube migrations in eqiad complete, repooling d8 [17:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:59] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet [17:51:01] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet [17:51:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393676 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet co... [17:51:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393677 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqia... [17:51:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:53:13] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aux-k8s-worker1007.eqiad.wmnet with reason: C/D Migration [17:53:40] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy2002.wikimedia.org with OS bookworm [17:53:56] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393679 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [17:54:03] jouncebot: nowandnext [17:54:04] For the next 0 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1700) [17:54:04] In 0 hour(s) and 5 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1800) [17:54:04] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1800) [17:54:48] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aux-k8s-worker1006.eqiad.wmnet with reason: C/D Migration [17:55:00] swfrench-wmf: note i am (still) running scap :-/ got hit by the full image build :-/ [17:56:24] urbanecm: ah, thanks for the heads-up! my changes should be fairly quick (i.e., don't require the full window), so just keep me posted :) [17:56:36] swfrench-wmf: i'll ping you once done! [17:58:15] (03PS4) 10Bking: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse) [17:58:30] !log robh@cumin2002 START - Cookbook sre.mysql.parsercache [17:58:31] !log robh@cumin2002 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [17:59:34] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy2001.wikimedia.org with OS bookworm [17:59:52] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [17:59:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:00:05] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1800). [18:00:05] swfrench-wmf: May I have your attention please! MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1800) [18:01:06] FYI, holding for now. I'll be deploying scap and then deploying with said scap :) [18:02:22] I don't have anything to deploy this week swfrench-wmf [18:04:34] bd808: ack, thanks! [18:04:50] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1207920|hotfix: Disable Urdu alias for Special:Homepage (T410666)]], [[gerrit:1207919|hotfix: Disable Urdu alias for Special:Homepage (T410666)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:04:54] T410666: Visiting Special:Homepage is not recognised on urwiki - https://phabricator.wikimedia.org/T410666 [18:05:26] !log urbanecm@deploy2002 urbanecm: Continuing with sync [18:05:33] patch "fixes" the problem, poceeding [18:05:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11393717 (10RobH) Update: @Ladsgroup had other things going on and wasn't able to do this today but did link me to the directions on how to depool: https:... [18:05:55] (03CR) 10Ssingh: [C:03+2] hiera: common.yaml: add hcaptcha to all sites [puppet] - 10https://gerrit.wikimedia.org/r/1207922 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [18:08:54] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3001.wikimedia.org with OS bookworm [18:09:09] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [18:09:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393719 (10RobH) Please note all wikikube workers have been migrated and we're now down to only 4 hosts left with #serviceops to migrate: wikikube-ctrl1003 kafka-main1008 kafk... [18:09:26] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS bookworm [18:09:30] sukhe@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [18:09:41] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [18:09:47] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4001.wikimedia.org with OS bookworm [18:09:59] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [18:10:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11393725 (10Marostegui) @robh I'm out and not near a keyboard but you have to replace pc1016 with pc6 [18:11:17] (03PS1) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) [18:11:24] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4002.wikimedia.org with OS bookworm [18:11:37] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393740 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [18:11:46] (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:11:52] (03PS2) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) [18:12:23] (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:12:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester) [18:12:42] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7662/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:12:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:12:53] (03PS1) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) [18:13:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [18:13:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11393756 (10RobH) Day 8 Update: * 22 hosts moved today, 22 remain ** all wikikube and aux host migrations completed ** (3) pc hosts in disucssion with data-p... [18:13:55] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy5001.wikimedia.org with OS bookworm [18:14:03] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy5002.wikimedia.org with OS bookworm [18:14:08] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393760 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [18:14:17] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [18:15:15] (03CR) 10Scott French: [C:03+1] fetch_external_clouds_vendors_nets.py: ipblock-source support [puppet] - 10https://gerrit.wikimedia.org/r/1207848 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [18:16:10] (03PS3) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) [18:16:39] (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:16:58] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7663/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:17:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11393787 (10RobH) 05Open→03Resolved All #infrastructure-foundations hosts in eqiad c/d rows migrated to the new switch stacks. [18:17:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:18:24] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:39] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207920|hotfix: Disable Urdu alias for Special:Homepage (T410666)]], [[gerrit:1207919|hotfix: Disable Urdu alias for Special:Homepage (T410666)]] (duration: 41m 57s) [18:18:43] T410666: Visiting Special:Homepage is not recognised on urwiki - https://phabricator.wikimedia.org/T410666 [18:18:48] yeah the drmrs one I take repsonsiblity for. reimage happening soon. non-prodhost, completely fine to ignore [18:18:51] swfrench-wmf: (finally) done! [18:19:10] urbanecm: thank you! [18:21:47] !log sudo cumin 'A:lvs-eqiad or A:lvs-codfw' 'disable-puppet "set druid-coordinator to state lvs_setup"' [18:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:14] !log swfrench@deploy2002 Installing scap version "4.228.0" for 2 host(s) [18:22:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:22:53] (03PS4) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) [18:23:14] (03CR) 10Ssingh: [C:03+2] LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [18:23:22] (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:23:42] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7664/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:24:01] !log swfrench@deploy2002 Installation of scap version "4.228.0" completed for 2 hosts [18:24:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:25:46] (03PS5) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) [18:26:15] (03CR) 10CI reject: [V:04-1] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:26:32] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7665/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [18:26:42] !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run after switching scap mwscript to PHP 8.3 - T405955 [18:26:46] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:27:00] !log sukhe@lvs1020:~$ sudo systemctl restart pybal.service: T406222 [18:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:04] T406222: Add druid coordinator service to LVS for the druid_public cluster. - https://phabricator.wikimedia.org/T406222 [18:27:14] !log swfrench@deploy2002 Stopping before sync operations [18:27:44] (03PS2) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) [18:28:46] !log swfrench@deploy2002 Started scap sync-world: Normal scap run after switching scap mwscript to PHP 8.3 - T405955 [18:29:07] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:29:13] ok [18:29:18] that is me, looking [18:29:27] this change https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/195335854fd83138604462f54264e0de4b3c8daf [18:29:35] puppet is disabled everywhere else, so no worries [18:29:40] (on LVS'es) [18:29:57] * swfrench-wmf thumbs up [18:29:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:33:35] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4001.wikimedia.org with reason: host reimage [18:33:50] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy3002.wikimedia.org with reason: host reimage [18:33:58] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4002.wikimedia.org with reason: host reimage [18:33:59] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy3001.wikimedia.org with reason: host reimage [18:34:20] !log swfrench@deploy2002 Finished scap sync-world: Normal scap run after switching scap mwscript to PHP 8.3 - T405955 (duration: 05m 34s) [18:34:25] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:34:34] weird, the service is definitely known to IPVS [18:34:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:35:03] (03PS3) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) [18:35:58] TCP 10.2.2.15:8081 mh (mh-port) [18:36:52] (03PS5) 10Bking: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse) [18:37:04] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:37:18] ^ yeah that's fine, will be reimaged soon, non prod [18:37:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:38:51] (03PS2) 10Scott French: De-configure cookie-based enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) [18:39:41] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Restore beta cluster php_fpm_restart_script setting [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) [18:39:51] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4001.wikimedia.org with reason: host reimage [18:39:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:40:00] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [18:42:04] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:42:35] ah I see what's happening with the LVS change [18:42:36] ok reverting [18:42:40] (03CR) 10Bearloga: [C:03+2] EventStreamConfig: drop revision and namespace id from contributors.experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [18:43:25] (03Merged) 10jenkins-bot: EventStreamConfig: drop revision and namespace id from contributors.experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207882 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [18:44:13] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy3002.wikimedia.org with reason: host reimage [18:46:06] (03PS1) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) [18:46:34] (03PS1) 10Ssingh: Revert "LVS: set druid-coordinator to state lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/1207933 [18:46:51] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy3001.wikimedia.org with reason: host reimage [18:49:44] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4002.wikimedia.org with reason: host reimage [18:51:08] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:51:52] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:52:42] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:52:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:52:52] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:55:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:58:24] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:59:46] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4001.wikimedia.org with OS bookworm [18:59:58] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [19:00:04] brennen and andre: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1900). [19:00:36] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy5001.wikimedia.org with reason: host reimage [19:00:49] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy5002.wikimedia.org with reason: host reimage [19:01:10] (03PS6) 10Bking: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse) [19:01:58] brennen: FYI T409743 seems to have gotten sorted out already [19:01:58] T409743: English Wikibooks main page subpages under cascading protection are editable by anyone, and MP stylesheets do not display protection messages to non-admins - https://phabricator.wikimedia.org/T409743 [19:02:09] o/ [19:02:13] andre: thx [19:02:27] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy3002.wikimedia.org with OS bookworm [19:02:41] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [19:03:13] (03CR) 10Jsn.sherman: "@murbanec@wikimedia.org that sounds like a good solution, but I'm loath to use set logic in this repo unless I am really clear on how it w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [19:03:24] RESOLVED: JobUnavailable: Reduced availability for job nginx in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:03:40] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy5001.wikimedia.org with reason: host reimage [19:04:03] (03CR) 10RLazarus: [C:03+1] "From my read of https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/master/modules/ext.wikimediaEvents/phpEngine.js I li" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:04:15] (03CR) 10Jsn.sherman: "> the plan is to add this config to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [19:04:44] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy3001.wikimedia.org with OS bookworm [19:04:52] !log 1.46.0-wmf.3 train status (T408273): no current blockers, rolling to all wikis [19:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:57] T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273 [19:05:03] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [19:05:25] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207945 (https://phabricator.wikimedia.org/T408273) [19:05:27] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207945 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [19:06:21] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207945 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [19:07:27] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy5002.wikimedia.org with reason: host reimage [19:08:35] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4002.wikimedia.org with OS bookworm [19:08:47] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [19:09:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:11:30] (03PS2) 10Aaron Schulz: Mark non-wikimedia.org math APIs as deprecated in the sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773) [19:16:28] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy6001.wikimedia.org with OS bookworm [19:16:31] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy6002.wikimedia.org with OS bookworm [19:16:35] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.3 refs T408273 [19:16:36] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7001.wikimedia.org with OS bookworm [19:16:40] T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273 [19:16:41] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7002.wikimedia.org with OS bookworm [19:16:43] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393895 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [19:16:44] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393896 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [19:16:50] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [19:16:57] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393899 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was st... [19:19:15] (03PS1) 10Brennen Bearnes: Do not pass callback arguments to incompatible method [extensions/GlobalPreferences] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207950 (https://phabricator.wikimedia.org/T410551) [19:20:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393903 (10Scott_French) @RobH - Confirming conf1009 for 2025-11-24, but the SRE staff meeting runs from 17:00 - 18:00 UTC. I'd suggest starting no later than 16:30 (and pausin... [19:21:47] (03PS6) 10Ebernhardson: relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) [19:22:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393911 (10RobH) >>! In T405950#11393903, @Scott_French wrote: > @RobH - Confirming conf1009 for 2025-11-24, but the SRE staff meeting runs from 17:00 - 18:00 UTC. I'd suggest... [19:22:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:23:24] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:24:56] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy5001.wikimedia.org with OS bookworm [19:25:08] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [19:27:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:28:24] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:29:04] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy5002.wikimedia.org with OS bookworm [19:29:18] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11393921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [19:32:22] (03CR) 10Ssingh: [C:03+2] Revert "LVS: set druid-coordinator to state lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/1207933 (owner: 10Ssingh) [19:32:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:33:24] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:26] !log sukhe@lvs1020:~$ sudo systemctl restart pybal.service [19:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11393935 (10Scott_French) 18:15 UTC sounds good to me. Thank you! [19:37:00] !log sudo cumin 'A:lvs-eqiad or A:lvs-codfw' 'run-puppet-agent --enable "set druid-coordinator to state lvs_setup"' [19:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:40:54] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:43:01] brennen: once the train is clear and logs are clean, would it be alright if I sneak in a backport that didn't fit into the infra window earlier this morning? [19:43:43] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy6001.wikimedia.org with reason: host reimage [19:43:45] swfrench-wmf: go ahead - things are looking ok now. [19:43:58] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy7002.wikimedia.org with reason: host reimage [19:44:01] brennen: great, thank you! [19:44:04] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy7001.wikimedia.org with reason: host reimage [19:44:20] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy6002.wikimedia.org with reason: host reimage [19:45:12] (03CR) 10Ebernhardson: [C:03+1] apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse) [19:46:06] (03CR) 10Scott French: "Thanks for the review!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:46:14] (03CR) 10Bking: [C:03+2] apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse) [19:46:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:47:16] (03Merged) 10jenkins-bot: De-configure cookie-based enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204948 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:47:36] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1204948|De-configure cookie-based enrollment in PHP 8.3 (T405955)]] [19:47:41] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [19:47:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:48:47] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy6001.wikimedia.org with reason: host reimage [19:52:12] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1204948|De-configure cookie-based enrollment in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:52:44] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy7002.wikimedia.org with reason: host reimage [19:53:35] !log swfrench@deploy2002 swfrench: Continuing with sync [19:53:57] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [19:56:11] jouncebot: nowandnext [19:56:11] For the next 1 hour(s) and 3 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1900) [19:56:11] In 1 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T2100) [19:56:30] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy7001.wikimedia.org with reason: host reimage [19:56:35] (03PS1) 10Reedy: AccountRecovery: Log more data for account recovery submissions [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207963 [19:57:39] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204948|De-configure cookie-based enrollment in PHP 8.3 (T405955)]] (duration: 10m 03s) [19:57:44] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [19:59:48] (03CR) 10Reedy: [C:03+2] AccountRecovery: Log more data for account recovery submissions [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207963 (owner: 10Reedy) [20:00:53] swfrench-wmf: Do you need to deploy anything else? [20:01:00] Reedy: all done! [20:01:05] (03Merged) 10jenkins-bot: AccountRecovery: Log more data for account recovery submissions [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207963 (owner: 10Reedy) [20:01:05] cool, cheers [20:01:59] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1207963|AccountRecovery: Log more data for account recovery submissions]] [20:03:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T410589)', diff saved to https://phabricator.wikimedia.org/P85422 and previous config saved to /var/cache/conftool/dbconfig/20251120-200304-ladsgroup.json [20:03:09] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [20:03:16] (03CR) 10Dzahn: "using this gerrit patch also for open discussion what to do with it" [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn) [20:04:14] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy6002.wikimedia.org with reason: host reimage [20:06:17] !log reedy@deploy2002 reedy: Backport for [[gerrit:1207963|AccountRecovery: Log more data for account recovery submissions]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:19] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy6001.wikimedia.org with OS bookworm [20:07:36] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [20:07:40] !log reedy@deploy2002 reedy: Continuing with sync [20:08:09] (03PS1) 10Dzahn: admin: remove fisch-wmde from releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/1207967 [20:08:39] (03PS2) 10Dzahn: admin: remove wmde-fisch from releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/1207967 [20:11:41] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207963|AccountRecovery: Log more data for account recovery submissions]] (duration: 09m 42s) [20:12:33] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy7002.wikimedia.org with OS bookworm [20:12:47] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [20:14:23] (03CR) 10Aklapper: [C:03+1] "Looking at the list of four repos on https://phabricator.wikimedia.org/diffusion/query/6WUBLfM9eS2R/ , is it an intentional decision that " [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn) [20:14:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:32] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy7001.wikimedia.org with OS bookworm [20:15:46] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [20:16:23] (03CR) 10Dzahn: [C:03+2] "don't have too much background here but merging per "beta only"" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [20:17:15] (03CR) 10Ahmon Dancy: "Thanks dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [20:18:08] (03CR) 10Dzahn: "arrg, I am sorry, I am duplicating my own existing patch from the past" [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn) [20:18:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P85423 and previous config saved to /var/cache/conftool/dbconfig/20251120-201812-ladsgroup.json [20:18:24] RESOLVED: [4x] JobUnavailable: Reduced availability for job mtail in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:22:26] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11394056 (10Ladsgroup) Very likely a popular gadget/css hardcoding the url. I investigate once I get my hands on a PC [20:24:18] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy6002.wikimedia.org with OS bookworm [20:24:35] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage starte... [20:27:01] (03CR) 10Dzahn: "please still feel asked to review but move the discussion over to existing comments from Krinkle -> https://gerrit.wikimedia.org/r/c/opera" [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn) [20:27:08] (03Abandoned) 10Dzahn: switch historic Subversion URLs from Phabricator to static-codereview [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn) [20:27:34] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests, 13Patch-For-Review: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11394058 (10ssingh) 05Open→03Resolved a:03ssingh Once https://ger... [20:33:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P85424 and previous config saved to /var/cache/conftool/dbconfig/20251120-203320-ladsgroup.json [20:36:34] (03PS1) 10Ssingh: hiera: trafficserver: switch hcaptcha backend to anycast [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) [20:37:24] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7666/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [20:37:42] (03CR) 10Ssingh: [V:03+1 C:04-2] "Do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [20:38:57] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:39:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:51] (03PS1) 10Scott French: deployment_server: drop PHP 8.1 fallback in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) [20:44:53] (03PS1) 10Scott French: deployment_server: switch mw-script/main to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1207980 (https://phabricator.wikimedia.org/T405955) [20:44:54] (03PS1) 10Scott French: deployment_server: follow main release in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207981 (https://phabricator.wikimedia.org/T405955) [20:46:12] (03PS2) 10Scott French: deployment_server: switch mw-script/main to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1207980 (https://phabricator.wikimedia.org/T405955) [20:46:13] (03PS2) 10Scott French: deployment_server: follow main release in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207981 (https://phabricator.wikimedia.org/T405955) [20:47:07] (03CR) 10Scott French: "Thanks in advance for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:48:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T410589)', diff saved to https://phabricator.wikimedia.org/P85425 and previous config saved to /var/cache/conftool/dbconfig/20251120-204827-ladsgroup.json [20:48:32] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [20:48:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [20:48:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T410589)', diff saved to https://phabricator.wikimedia.org/P85426 and previous config saved to /var/cache/conftool/dbconfig/20251120-204852-ladsgroup.json [20:54:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [20:59:46] jouncebot: nowandnext [20:59:46] For the next 0 hour(s) and 0 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T1900) [20:59:46] In 0 hour(s) and 0 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T2100) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T2100). [21:00:05] bvibber and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] (03CR) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler) [21:00:11] o/ [21:00:20] Hey. [21:00:23] i can spiderpig one or both in a pinch [21:00:29] Sure, go for it. [21:00:35] woo [21:00:36] My one is nominally trivial. [21:00:43] (He says…) [21:00:45] mine is a one-character change :D [21:00:54] Mine is a one-extension change. ;-) [21:01:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:01:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [21:01:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester) [21:02:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:02:06] (03CR) 10Dzahn: [C:03+1] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/1207864 (https://phabricator.wikimedia.org/T410628) (owner: 10Slyngshede) [21:02:37] (03Merged) 10jenkins-bot: Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [21:02:40] (03Merged) 10jenkins-bot: Undeploy the WikimediaEditorTasks extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207865 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester) [21:03:00] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1207273|Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps (T372165)]], [[gerrit:1207865|Undeploy the WikimediaEditorTasks extension (T376954)]] [21:03:06] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:03:07] T376954: Stop using and then undeploy the WikimediaEditorTasks extension - https://phabricator.wikimedia.org/T376954 [21:04:03] hmm, i wonder if removing an extension will trigger a slow localization cache sync actually :D [21:04:11] no worries, perfect time for it to be slow [21:04:31] Yeah, sorry, didn't think of that. [21:04:43] OTOH, your patch is beta-only, so you're not waiting for the sync anyway. [21:04:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:06:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:08:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11394161 (10Jclark-ctr) a:05LSobanski→03RobH [21:09:09] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:09:21] As annoying as it is, I think it is nice that a slow deploy today takes like 20 minutes instead of the 60 minutes that a full scap with l10n rebuild once took. We are getting better in increments. :) [21:10:01] True. [21:10:08] there is still a lot to shake fists at though. [21:10:10] Also `21:04:09 Finished l10n-update (duration: 01m 06s)`. [21:10:26] Though the images deltas will be big, of course. [21:10:40] The load-i18n-from-JSON work would be nice [21:10:51] whee [21:11:26] we just need to put the localization cache in a redis cluster and put a front-end cache on it [21:11:35] I'm not sure what at this point would be the "better" replacement for the CDB files. [21:11:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:11:52] I'm not sure a LRU cache is ideal for a pan-lingual message cache. :-) [21:12:37] if we can afford to externalize it all that would be nice. having l10noid or something as the magic fast external, shared cache [21:12:44] separate the l10n cache from mediawiki deployments by introducing a new localisatoid microservice [21:12:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:13:02] call it the remotisation cache [21:13:16] there are a seriously hot paths for l10n lookup [21:13:22] Yes. [21:13:55] So hot that I'd not want it to go over the k8s service boundary. [21:14:02] Maybe a sidecar? Eh. [21:14:31] having an efficiently queriable on-device database makes a lot of sense for our use case where shit is extensively using these lookups serially during long page generations :D [21:14:40] but updating those databases efficiently seems hard [21:14:59] as I understand it, one of the slow parts today is bundling all of the messages in a given language into a big blob. The json stuff we did to make rsync for that faster doesn't help with adding to a Docker container's layer size [21:15:27] the slow is not really the bundling, but what it does to the layer delta in the image [21:15:58] and adding 1 en message without translation touches every other language [21:16:09] 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11394186 (10Dzahn) a:03OKryva-WMF Hi Ollie, let us know how it's going. Cheers. -- Daniel [21:16:13] (I think) [21:16:13] i feel like append-only updates to the files is what we really want? but that requires depending on the previous build output to make a new build [21:16:17] 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11394188 (10Dzahn) p:05Triage→03Medium [21:17:06] In the docker containers our deltas are at a file level rather than a line level [21:17:21] oof [21:17:40] Yes, we take per-language-sparse-json i18n files, build them into per-language-complete-cbd files, and write them into the docker image alongside the actual code as a relatively low layer (I think?); switching to per-language-complete-json files without changing the docker build step won't help much, but might be a smidge faster to build in scap and load inside MW. The main value is the… yes. [21:17:57] Speaking of which, docker image build from scape is still on-going, 14 minutes later. Sigh. [21:18:26] xkcd 303 [21:18:40] * bd808 was trying to keep y'all occupied so you wouldn't worry about the wall clock time ;) [21:18:45] Indeed. [21:19:01] I can do code review with an eye open for spiderpig and IRC. :-) [21:19:04] hehe [21:19:58] getting to single version containers will help some I think [21:20:20] Yes, if nothing else, half the bytes shipped. [21:20:24] bvibber: The incremental image build process that we use does in fact require access to the previous build, so that is not an unreasonable requirement. [21:21:01] oooh spiffy [21:21:49] there is still a 5 minute wait penalty on large image uploads into the container registry to work around a bug in the swift storage backend [21:22:24] Oh, yeah. Will the smaller images from single-version-containers help with that? [21:22:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:23:25] I think so? We trigger the wait on the image size delta if I am remembering correctly [21:23:38] dancy has done neat and smart things in the incremental builds. [21:24:13] > Finished build-and-push-container-images (duration: 19m 48s) [21:24:35] Hurrah. Finally. [21:24:52] I do miss the sub-60-seconds config deploys I used to do back in the day. [21:25:11] Pre-linting, pre-canaries, pre-everything. But gosh was it fast. [21:25:16] hehe [21:25:18] yeah, file sync was nice for some stuff for sure [21:25:33] But also allowed us to take down production with arrray(). [21:25:40] So… let's not regress. [21:25:58] I'm sure bvibber can tell of the days when it was as easy as using vim on the nfs server :) [21:26:24] oh yeah i remember straight up debugging editing files on nfs raw [21:26:33] add some more printfs [21:26:35] :D [21:26:56] Fun times. [21:26:57] I was still doing that for wikitech up to like 2023 :) [21:27:32] I mean, there were some patches that could only be deployed by me making manual changes to mediawiki-staging and syncing them bit by bit. [21:27:36] * James_F shudders. [21:28:10] bvibber: I think this is labs only patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207273 [21:28:29] Amir1: Yes, it was. [21:28:34] don't we need something for production or it's somewhere and I missed it :D [21:28:45] if it's intentional, then go ahead [21:28:57] Welcome back to another episode of Everything sucks today, but wait until you hear how bad it used to suck with your hosts bvibber, James_F, and bd808 [21:29:10] !log bvibber@deploy2002 bvibber, jforrester: Backport for [[gerrit:1207273|Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps (T372165)]], [[gerrit:1207865|Undeploy the WikimediaEditorTasks extension (T376954)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:29:16] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:29:16] T376954: Stop using and then undeploy the WikimediaEditorTasks extension - https://phabricator.wikimedia.org/T376954 [21:29:17] Amir1: aaaaaaagh you're right [21:29:19] bd808: I'd listen to that podcast. [21:29:24] i'll fix that.... when this is done [21:29:38] would you sell energy drinks too? I'd buy [21:29:50] Amir1: what about an NFT? [21:30:03] Some ponzi scheme crypto? [21:30:04] I would like a pain relier as one of the sponsors [21:30:08] bvibber: Good to proceed at my end. [21:30:22] !log bvibber@deploy2002 bvibber, jforrester: Continuing with sync [21:31:11] NFS -> NFT, not that different [21:31:22] James_F: We actually could do a live version at the next hackathon... special guests Reedy and Amir1 to add more commentary. [21:31:25] Makes you regret your life choices? [21:31:31] (03PS1) 10Bvibber: Fix wgMediaViewerThumbnailBucketSizes on prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208003 (https://phabricator.wikimedia.org/T372165) [21:31:58] (03PS9) 10Cwhite: prometheus: split targets into directories by source [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) [21:32:25] shit maybe we should do a CDB replacement project at hackathon [21:32:38] I hear AI has a great condensed format to replace json [21:32:40] (03CR) 10Cwhite: prometheus: split targets into directories by source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite) [21:33:17] bvibber: Sounds fun! [21:33:23] the amount of "why is this taking so long" that could address [21:33:43] That would be great [21:34:26] Ideally someone from RelEng would be there, rather than us just hacking on their day job and making a mess. [21:34:41] "how hard can an append-only database with efficient lookups be" [21:34:41] R.I.P. Brooke Vibber, 1978-2025 died from reading too many computer science papers [21:35:13] single version containers + l10n in php + op code cache [21:35:30] actually i wonder about the relative preformance of sqlite [21:35:43] probably too slow though [21:35:49] (03PS1) 10Reedy: AccountRecovery: Allow temp users to access Special:AccountRecovery [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208005 [21:35:59] Q: Isn't the l10n data stored in memcache or something like that? [21:36:26] sqlite should be testable, but I'd ask Tim first if he already tried it [21:37:05] dancy: you can configure for that, but CDB is faster [21:37:13] !log /srv/thanos-store cleanup on titan2001 (start) [21:37:15] I see. [21:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:40] (03CR) 10Bking: [C:03+2] relforge: Change to test role [puppet] - 10https://gerrit.wikimedia.org/r/1207930 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [21:37:43] no network latency and really fast indexing [21:37:57] Maybe related question: What does the purgeMessageBlobStore.php maintenance script do? [21:37:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:38:15] and does it have any relevant for the k8s deployments. [21:38:20] *relevance [21:38:28] (03CR) 10Dzahn: "to clarify: the review request was not about checking every line of code. it was meant to be like "hey, is this the right place and idea t" [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [21:38:34] Clears some ResourceLoader related message blobs [21:38:37] scap sync-world runs it at the end of the each deployment. [21:39:12] I vaguely think this is still needed. [21:39:14] $cache->touchCheckKey( self::makeGlobalPurgeKey( $cache ), $cache::HOLDOFF_TTL_NONE ); [21:39:16] MessageBlobStore is the cache of 10n strings for javascript, right? [21:39:26] *l10n [21:39:33] Yes. [21:39:37] ah [21:39:51] And it's persistent between MW images. [21:39:51] (03CR) 10Reedy: [C:03+2] AccountRecovery: Allow temp users to access Special:AccountRecovery [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208005 (owner: 10Reedy) [21:40:10] bvibber: It turns out Reedy is deploying over your patch. ;-) [21:40:17] Ci-ing [21:40:19] hah [21:40:56] (03Merged) 10jenkins-bot: AccountRecovery: Allow temp users to access Special:AccountRecovery [extensions/EmailAuth] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208005 (owner: 10Reedy) [21:40:59] (03CR) 10Dzahn: "@aklapper@wikimedia.org That is a good question. I don't have the answer I think. If you don't mind could you repeat that maybe on the dup" [puppet] - 10https://gerrit.wikimedia.org/r/1207296 (owner: 10Dzahn) [21:41:17] (03PS1) 10Scott French: deployment_server: switch deployment hosts to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208006 (https://phabricator.wikimedia.org/T405955) [21:41:17] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208006 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [21:41:29] I like the idea of PHP-only l10n files, with multiple files per language, to allow for incremental changes to specific keys [21:42:55] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207273|Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps (T372165)]], [[gerrit:1207865|Undeploy the WikimediaEditorTasks extension (T376954)]] (duration: 39m 55s) [21:43:01] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:43:02] T376954: Stop using and then undeploy the WikimediaEditorTasks extension - https://phabricator.wikimedia.org/T376954 [21:43:31] dancy: ok you're gonna laugh but i think i can make php array source work as an rsync-friendly format by manipulating whitespace or comments [21:44:06] actually can i just append things and they'll overwrite? [21:44:29] Reedy: were you needing to deploy something? [21:44:53] if not i'll do 1208003 [21:44:56] Yeah, I was going to deploy something to fix an issue with Special:AccountRecovery for temp accounts [21:45:02] bvibber: Any change to a file will result in the new copy of the file being in the image in full, shadowing the old file. [21:45:06] You can stick it out at the same time as something else though too [21:45:20] spiffy. you might throwing my 1208003 in with your wash? :) [21:45:23] *mind [21:46:06] bvibber: I encourage hacking and I'm happy to test ideas in my train-dev environment. [21:46:07] dancy: ah in that case i'll need a file per version. that has performance implications but is manageable [21:46:13] (03CR) 10Reedy: [C:03+2] Fix wgMediaViewerThumbnailBucketSizes on prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208003 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [21:46:14] cool! [21:46:24] Docker layer diffs being file scoped rather than line scoped is a thing that I keep forgetting and remembering again [21:46:56] bvibber: Agreed. It's a matter of tradeoffs but there's probably a reasonable threshold where it can still pay off. [21:47:02] (03Merged) 10jenkins-bot: Fix wgMediaViewerThumbnailBucketSizes on prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208003 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [21:47:04] yeah definitely [21:47:39] Maybe we should make a Phab task for this idea as a Hackathon project, so we can CC historically-involed people like K.rinkle and _.joe_ in case they're interested? [21:48:00] https://phabricator.wikimedia.org/project/view/8319/ [21:48:40] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1208003|Fix wgMediaViewerThumbnailBucketSizes on prod (T372165)]], [[gerrit:1208005|AccountRecovery: Allow temp users to access Special:AccountRecovery]] [21:48:44] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:50:28] (03PS1) 10Alexandros Kosiaris: base: Add starship in trixie and beyond [puppet] - 10https://gerrit.wikimedia.org/r/1208012 [21:50:28] (03PS1) 10Alexandros Kosiaris: base: Switch away from legacy fact, lint ignore $::realm [puppet] - 10https://gerrit.wikimedia.org/r/1208013 [21:51:02] James_F: let's do it! you wanna open a task or shall i? [21:51:24] bvibber: You go for it. I'll CC in. [21:51:28] sweet [21:51:34] bvibber: Do you need/want to test yours when it's ready? Or don't really care? [21:51:43] You have more social capital, after all. :-) [21:51:46] Reedy: naw it's functionally identical to the previosu behavior [21:51:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:52:02] just ... no longer setting a setting that doesn't work in most cases ;) [21:52:40] James_F: the best thing about being in the bug-oisie for so long is having all this social capital [21:53:22] T99740 is a thing to read too when thinking about CDB -> PHP shifts [21:53:22] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [21:54:06] !log reedy@deploy2002 bvibber, reedy: Backport for [[gerrit:1208003|Fix wgMediaViewerThumbnailBucketSizes on prod (T372165)]], [[gerrit:1208005|AccountRecovery: Allow temp users to access Special:AccountRecovery]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:54:12] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:54:28] woo [21:54:37] I note that irc ping is happening before a short hang to let you continue on the console [21:54:39] !log reedy@deploy2002 bvibber, reedy: Continuing with sync [21:54:56] Reedy: That was a requested feature. :-) [21:55:10] "GET READY TO TEST" [21:55:12] * taavi fears it was them who requested it [21:55:43] https://imgflip.com/i/acr2d0 [21:55:52] (03PS1) 10Jforrester: tables-catalog: Drop WikimediaEditorTasks tables [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954) [21:55:57] haha [21:56:24] If we did everything on slack, we could have so many more images inline in the process [21:56:44] https://phabricator.wikimedia.org/T378740 [21:56:46] and emojis [21:56:51] Reedy: no [21:56:52] Discuss there. :-) [21:56:54] Reedy: this is the real reason for spiderpig [21:59:55] hi! I would like to do a security deploy. is something currently going on? [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251120T2200) [22:00:24] James_F: https://phabricator.wikimedia.org/T410694 for starters [22:00:46] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208003|Fix wgMediaViewerThumbnailBucketSizes on prod (T372165)]], [[gerrit:1208005|AccountRecovery: Allow temp users to access Special:AccountRecovery]] (duration: 12m 06s) [22:00:51] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [22:01:08] maryum: ^ check with Reedy [22:01:49] My (and bvibber's deploy) is done [22:02:02] whee thanks Reedy [22:02:24] Need to see if the web team show up to use the window [22:03:24] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:04:04] 99.9% of the time they do not (web team) [22:04:26] FIRING: InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 1084 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [22:04:46] o/ [22:04:55] o/ [22:04:58] !incidents [22:04:58] 7041 (UNACKED) InboundMXQueueHigh sre (mx-in1001:9154 eqiad) [22:04:58] 7036 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [22:05:03] !ack 7041 [22:05:03] 7041 (ACKED) InboundMXQueueHigh sre (mx-in1001:9154 eqiad) [22:05:50] cwhite: I seem to recall this happening recently [22:06:20] SRE folks: should we hold off on the couple of security deploys while you look into the above issue? [22:06:50] bvibber: Tsk, localisation cache is the canonical spelling in MW-land! ;-) [22:07:03] sbassett: I think you should be good to go, cwhite - any concerns? [22:07:26] I agree - probably ok to continue with deploys [22:07:44] * swfrench-wmf thumbs up [22:07:51] Yes, if they've not shown up by now, go for it. [22:08:23] James_F I'm about to start my deploy in a min [22:08:34] maryum: +1 [22:08:57] lol [22:09:26] FIRING: [2x] InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 1084 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [22:09:36] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [22:09:49] !incidents [22:09:50] 7041 (ACKED) InboundMXQueueHigh sre (mx-in1001:9154 eqiad) [22:09:50] 7036 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [22:12:54] (03CR) 10Andrea Denisse: "Hi! Could you please add tests for the alerts?" [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [22:12:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:14:26] RESOLVED: [2x] InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 1084 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [22:15:13] (03CR) 10Jforrester: "The actually dropping of these tables, T410692, should happen first!" [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester) [22:15:24] end conflict in phab [22:15:29] *edit conflict in phab, just like old times [22:16:28] bvibber: yeah, sorry. I think I got your changes back? [22:16:36] yep thx :D [22:16:39] (03CR) 10Andrea Denisse: [C:03+2] Alertmanager: Add fr-tech-ops and update fr-tech groups [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt) [22:16:50] phab's complete lack of conflict detection is annoying [22:17:10] scap currently running [22:18:14] (03CR) 10Cwhite: [C:04-1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight) [22:19:27] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:22:57] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:24:10] (03PS1) 10Bking: opensearch: Fix typo in repo GPG key filename. [puppet] - 10https://gerrit.wikimedia.org/r/1208020 (https://phabricator.wikimedia.org/T407123) [22:24:46] (03PS1) 10SBassett: ActionApi: Remove the xslt option [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208021 (https://phabricator.wikimedia.org/T401987) [22:24:48] !log mstyles Deployed security patch for T407157 [22:25:05] scap finished! [22:25:42] sbassett all yours now [22:27:06] tx [22:29:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:31:35] (03PS1) 10MusikAnimal: ChangesListHooks: show entity titles in recent changes and watchlists [extensions/CommunityRequests] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208022 (https://phabricator.wikimedia.org/T406957) [22:35:44] (03CR) 10Ryan Kemper: [C:03+1] opensearch: Fix typo in repo GPG key filename. [puppet] - 10https://gerrit.wikimedia.org/r/1208020 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking) [22:35:56] (03CR) 10Bking: [C:03+2] opensearch: Fix typo in repo GPG key filename. [puppet] - 10https://gerrit.wikimedia.org/r/1208020 (https://phabricator.wikimedia.org/T407123) (owner: 10Bking) [22:36:00] PROBLEM - Host cirrussearch2061 is DOWN: PING CRITICAL - Packet loss = 100% [22:37:28] RECOVERY - Host cirrussearch2061 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [22:38:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208021 (https://phabricator.wikimedia.org/T401987) (owner: 10SBassett) [22:39:00] (03PS1) 10BryanDavis: toolforge: Add redis-tools to bastions [puppet] - 10https://gerrit.wikimedia.org/r/1208023 (https://phabricator.wikimedia.org/T410102) [22:39:45] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge[1008-1010].eqiad.wmnet with reason: T410681 [22:39:50] T410681: Setup opensearch 3 on relforge servers - https://phabricator.wikimedia.org/T410681 [22:42:03] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS trixie [22:42:25] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:36] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2061 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:43:09] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS trixie [22:43:09] (03Merged) 10jenkins-bot: ActionApi: Remove the xslt option [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208021 (https://phabricator.wikimedia.org/T401987) (owner: 10SBassett) [22:43:28] !log sbassett@deploy2002 Started scap sync-world: Backport for [[gerrit:1208021|ActionApi: Remove the xslt option (T401987 T401995)]] [22:43:33] T401987: Consider deprecating/removing the xslt option from the action api - https://phabricator.wikimedia.org/T401987 [22:43:52] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1010.eqiad.wmnet with OS trixie [22:45:02] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [22:45:07] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:46:11] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1010.eqiad.wmnet with OS trixie [22:46:15] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1009.eqiad.wmnet with OS trixie [22:46:23] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1008.eqiad.wmnet with OS trixie [22:47:31] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1010.eqiad.wmnet with OS bookworm [22:48:03] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bookworm [22:48:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm [22:49:39] !log bking@apt1002 reprepro --component thirdparty/opensearch3 update bookworm-wikimedia [22:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:21] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/output/1208023/7668/" [puppet] - 10https://gerrit.wikimedia.org/r/1208023 (https://phabricator.wikimedia.org/T410102) (owner: 10BryanDavis) [22:50:37] !log bking@apt1002 reprepro --component thirdparty/opensearch2 update bookworm-wikimedia [22:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:37] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2061 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:54:16] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:54:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:55:24] bking@cumin2002 reimage (PID 1383021) is awaiting input [22:55:58] bking@cumin2002 reimage (PID 1383273) is awaiting input [22:57:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1008.eqiad.wmnet with OS bookworm [22:57:25] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1009.eqiad.wmnet with OS bookworm [22:58:10] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm [22:59:01] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage [23:04:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage [23:05:32] bking@cumin2002 reimage (PID 1388071) is awaiting input [23:07:02] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1008.eqiad.wmnet with OS bookworm [23:07:24] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm [23:09:43] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bookworm [23:10:14] !log restarted postfix on mx-in1001, mx-in2001 at ~ 23:00 UTC for config change [23:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:56] (03CR) 10WMDE-Fisch: [V:03+1] admin: remove wmde-fisch from releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/1207967 (owner: 10Dzahn) [23:14:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:17:04] bking@cumin2002 reimage (PID 1394483) is awaiting input [23:19:17] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:19:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1010.eqiad.wmnet with OS bookworm [23:19:37] !log sbassett@deploy2002 sbassett: Backport for [[gerrit:1208021|ActionApi: Remove the xslt option (T401987 T401995)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:19:43] T401987: Consider deprecating/removing the xslt option from the action api - https://phabricator.wikimedia.org/T401987 [23:20:00] !log sbassett@deploy2002 sbassett: Continuing with sync [23:24:17] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:24:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:25:13] !log /srv/thanos-store cleanup on titan2001 (end) [23:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:41] (03CR) 10Bking: [C:03+1] "Thanks for reaching out on this one!" [puppet] - 10https://gerrit.wikimedia.org/r/1203548 (owner: 10CDanis) [23:32:46] !log sbassett@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208021|ActionApi: Remove the xslt option (T401987 T401995)]] (duration: 49m 18s) [23:32:51] T401987: Consider deprecating/removing the xslt option from the action api - https://phabricator.wikimedia.org/T401987 [23:34:28] jouncebot: nowandnext [23:34:28] No deployments scheduled for the next 7 hour(s) and 25 minute(s) [23:34:28] In 7 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251121T0700) [23:35:22] sbassett: If you are done I would like to push out a couple of wikitech config changes. No worries if you are still working on stuff. [23:35:56] (03PS1) 10Scott French: mw-*: clean up 8.3 migration rollingUpdate and timeout tweaks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208037 (https://phabricator.wikimedia.org/T405955) [23:38:32] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1009.eqiad.wmnet with OS bookworm [23:39:09] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bookworm [23:39:39] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1008.eqiad.wmnet with OS bookworm [23:40:05] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm [23:41:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11394566 (10RobH) Clarification Questions and statements: * Most hosts previously had both 1G and 10G just these new config E are 10G only so they'll have to be in 10G capable ra... [23:41:32] (03PS1) 10Scott French: deployment_server: switch mw-debug/pinkunicorn to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208039 (https://phabricator.wikimedia.org/T405955) [23:45:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208022 (https://phabricator.wikimedia.org/T406957) (owner: 10MusikAnimal) [23:46:14] bd808 pretty sure sbassett is done [23:46:26] thx maryum [23:46:30] bking@cumin2002 reimage (PID 1411105) is awaiting input [23:47:25] bking@cumin2002 reimage (PID 1411603) is awaiting input [23:47:29] heh. musikanimal jumped the queue before I got there [23:47:44] sorry! lol [23:48:42] yours at least doesn't look like it will cause a third full l10n rebuild :) [23:48:46] there are no localization changes so I think this one shouldn't take very long [23:48:50] yeah lol! [23:52:53] (03Merged) 10jenkins-bot: ChangesListHooks: show entity titles in recent changes and watchlists [extensions/CommunityRequests] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1208022 (https://phabricator.wikimedia.org/T406957) (owner: 10MusikAnimal) [23:53:14] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1208022|ChangesListHooks: show entity titles in recent changes and watchlists (T406957)]] [23:53:18] T406957: Show wish titles on lists (like on Wikidata) - https://phabricator.wikimedia.org/T406957 [23:57:39] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1208022|ChangesListHooks: show entity titles in recent changes and watchlists (T406957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:58:13] !log musikanimal@deploy2002 musikanimal: Continuing with sync [23:59:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage