[00:23:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [00:28:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [00:34:33] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [00:40:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1229803 [00:40:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1229803 (owner: 10TrainBranchBot) [00:53:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1229803 (owner: 10TrainBranchBot) [00:57:37] (03PS1) 10Papaul: Update partman recipe for mwlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1229807 (https://phabricator.wikimedia.org/T412230) [00:57:45] (03CR) 10Zabe: [C:03+2] Start reading from il_target_id on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229607 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [00:58:38] (03Merged) 10jenkins-bot: Start reading from il_target_id on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229607 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [00:59:50] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1229607|Start reading from il_target_id on small wikis (T413669)]] [00:59:57] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [01:00:48] (03CR) 10Papaul: [C:03+2] Update partman recipe for mwlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/1229807 (https://phabricator.wikimedia.org/T412230) (owner: 10Papaul) [01:02:28] !log zabe@deploy2002 zabe: Backport for [[gerrit:1229607|Start reading from il_target_id on small wikis (T413669)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:02:51] !log zabe@deploy2002 zabe: Continuing with sync [01:07:02] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229607|Start reading from il_target_id on small wikis (T413669)]] (duration: 07m 12s) [01:07:07] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [01:10:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1229811 [01:10:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1229811 (owner: 10TrainBranchBot) [01:14:33] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [01:23:02] !log pt1979@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mwlog1003.eqiad.wmnet with reason: host reimage [01:27:25] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwlog1003.eqiad.wmnet with reason: host reimage [01:33:49] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1229811 (owner: 10TrainBranchBot) [01:46:56] !log pt1979@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1003" [01:47:15] !log pt1979@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1003" [01:47:16] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwlog1003.eqiad.wmnet with OS bookworm [01:47:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11543847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm completed: - mwlog... [01:55:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11543851 (10Papaul) @herron Hello the default partman recipe for mwlog is not working with new servers so to install mwlog1003 I created a new line for it in the p... [01:56:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11543852 (10Papaul) [02:20:56] (03PS1) 10Andrew Bogott: keystone: update init.d modules for flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229843 [02:21:41] (03CR) 10Andrew Bogott: [C:03+2] keystone: update init.d modules for flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229843 (owner: 10Andrew Bogott) [02:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:19:15] (03PS1) 10Andrew Bogott: keystone: change init.d module back to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/1229883 [03:19:15] (03PS1) 10Andrew Bogott: keystone: on flamingo, admin traffic goes to the public backend [puppet] - 10https://gerrit.wikimedia.org/r/1229884 [03:20:09] (03CR) 10Andrew Bogott: [C:03+2] keystone: change init.d module back to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/1229883 (owner: 10Andrew Bogott) [03:20:11] (03CR) 10Andrew Bogott: [C:03+2] keystone: on flamingo, admin traffic goes to the public backend [puppet] - 10https://gerrit.wikimedia.org/r/1229884 (owner: 10Andrew Bogott) [03:24:37] (03PS1) 10Andrew Bogott: keystone: remove admin service module for flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229890 [03:25:19] (03CR) 10Andrew Bogott: [C:03+2] keystone: remove admin service module for flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229890 (owner: 10Andrew Bogott) [03:29:23] (03PS1) 10Andrew Bogott: keystone: remove admin service module for flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229893 [03:29:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229893 (owner: 10Andrew Bogott) [03:32:51] (03PS2) 10Andrew Bogott: keystone: remove admin service module for flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229893 [03:32:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1229893 (owner: 10Andrew Bogott) [03:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:36:29] (03CR) 10Andrew Bogott: [C:03+2] keystone: remove admin service module for flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1229893 (owner: 10Andrew Bogott) [03:42:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [03:43:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87847 and previous config saved to /var/cache/conftool/dbconfig/20260122-034302-marostegui.json [03:43:10] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:43:11] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:49:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87848 and previous config saved to /var/cache/conftool/dbconfig/20260122-034905-marostegui.json [03:49:14] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:49:14] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:59:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P87849 and previous config saved to /var/cache/conftool/dbconfig/20260122-035914-marostegui.json [04:09:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P87850 and previous config saved to /var/cache/conftool/dbconfig/20260122-040922-marostegui.json [04:19:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87851 and previous config saved to /var/cache/conftool/dbconfig/20260122-041931-marostegui.json [04:19:39] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:19:39] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:19:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [04:19:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87852 and previous config saved to /var/cache/conftool/dbconfig/20260122-041956-marostegui.json [04:48:48] (03PS1) 10Samwilson: Enable watchlist labels on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229946 (https://phabricator.wikimedia.org/T413967) [05:09:13] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:27:29] (03PS1) 10Jasmine: aux-k8s: add sophroid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230108 [05:34:13] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:50:58] 06SRE, 10MW-on-K8s, 06serviceops, 06ServiceOps new, 10ServiceOps-SharedInfra: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11544011 (10jasmine_) [05:51:15] 06SRE, 10MW-on-K8s, 06ServiceOps new, 10ServiceOps-SharedInfra: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11544012 (10jasmine_) [06:05:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1239.eqiad.wmnet with reason: long schema change [06:06:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1240.eqiad.wmnet with reason: long schema change [06:21:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87853 and previous config saved to /var/cache/conftool/dbconfig/20260122-062114-marostegui.json [06:21:22] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:21:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:31:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P87854 and previous config saved to /var/cache/conftool/dbconfig/20260122-063122-marostegui.json [06:31:57] (03PS1) 10Marostegui: dbproxy2006.yaml: Testing Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1230137 (https://phabricator.wikimedia.org/T414656) [06:32:29] (03CR) 10Marostegui: [C:03+2] dbproxy2006.yaml: Testing Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1230137 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [06:33:58] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy2006.codfw.wmnet with OS trixie [06:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P87855 and previous config saved to /var/cache/conftool/dbconfig/20260122-064131-marostegui.json [06:49:57] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2006.codfw.wmnet with reason: host reimage [06:51:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87856 and previous config saved to /var/cache/conftool/dbconfig/20260122-065138-marostegui.json [06:51:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [06:51:46] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:51:47] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:54:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2006.codfw.wmnet with reason: host reimage [06:55:05] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1230144 (https://phabricator.wikimedia.org/T415238) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T0700) [07:00:06] marostegui, Amir1, and federico3: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T0700). [07:15:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2006.codfw.wmnet with OS trixie [07:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:33:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:36:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T410589)', diff saved to https://phabricator.wikimedia.org/P87857 and previous config saved to /var/cache/conftool/dbconfig/20260122-073604-ladsgroup.json [07:36:10] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [07:38:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:46:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P87858 and previous config saved to /var/cache/conftool/dbconfig/20260122-074612-ladsgroup.json [07:55:10] (03PS1) 10Muehlenhoff: Update account settings for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1230221 [07:56:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P87859 and previous config saved to /var/cache/conftool/dbconfig/20260122-075620-ladsgroup.json [07:56:37] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11544162 (10JAllemandou) >>! In T414460#11542216, @ops-monitoring-bot wrote: > Roll-reboot of nodes in... [07:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:59:34] (03PS1) 10Jasmine: deploy: Add sophroid kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1230226 [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:58] (03CR) 10Muehlenhoff: [C:03+2] Update account settings for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1230221 (owner: 10Muehlenhoff) [08:06:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T410589)', diff saved to https://phabricator.wikimedia.org/P87860 and previous config saved to /var/cache/conftool/dbconfig/20260122-080629-ladsgroup.json [08:06:34] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [08:06:46] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2225.codfw.wmnet with reason: Maintenance [08:06:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2225 (T410589)', diff saved to https://phabricator.wikimedia.org/P87861 and previous config saved to /var/cache/conftool/dbconfig/20260122-080653-ladsgroup.json [08:13:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Interlink (2a11:4141:6002::8) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:18:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Interlink (2a11:4141:6002::8) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:37:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11544190 (10MatthewVernon) That is strange - logs are in `/var/log/spicerack/sre/hosts/reimage.log` on cumin2002 for both reimage... [09:00:05] andre and jeena: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T0900) [09:00:06] 0/ [09:06:26] (03CR) 10Aklapper: [C:03+2] Revert "Fix DivisionByZeroError when calculating bitrate" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229724 (https://phabricator.wikimedia.org/T415169) (owner: 10Jforrester) [09:06:49] (03CR) 10Aklapper: [C:03+2] "+2'ing to unblock the train" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229724 (https://phabricator.wikimedia.org/T415169) (owner: 10Jforrester) [09:07:34] (03CR) 10Johannnes89: admin: Add johannnes89 to LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) (owner: 10Federico Ceratto) [09:07:38] (03Merged) 10jenkins-bot: Revert "Fix DivisionByZeroError when calculating bitrate" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1229724 (https://phabricator.wikimedia.org/T415169) (owner: 10Jforrester) [09:10:57] !log aklapper@deploy2002 Started scap sync-world: Backport for [[gerrit:1229724|Revert "Fix DivisionByZeroError when calculating bitrate" (T415169)]] [09:11:03] T415169: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169 [09:13:20] !log aklapper@deploy2002 jforrester, aklapper: Backport for [[gerrit:1229724|Revert "Fix DivisionByZeroError when calculating bitrate" (T415169)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:13:57] !log aklapper@deploy2002 jforrester, aklapper: Continuing with sync [09:18:11] !log aklapper@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229724|Revert "Fix DivisionByZeroError when calculating bitrate" (T415169)]] (duration: 07m 13s) [09:18:16] T415169: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169 [09:19:09] (03PS1) 10Arnaudb: aptrepo: update jenkins key [puppet] - 10https://gerrit.wikimedia.org/r/1230247 (https://phabricator.wikimedia.org/T415214) [09:19:13] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230250 (https://phabricator.wikimedia.org/T413803) [09:19:16] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230250 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [09:20:16] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230250 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [09:21:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1230247 (https://phabricator.wikimedia.org/T415214) (owner: 10Arnaudb) [09:22:51] (03CR) 10Arnaudb: [C:03+2] aptrepo: update jenkins key [puppet] - 10https://gerrit.wikimedia.org/r/1230247 (https://phabricator.wikimedia.org/T415214) (owner: 10Arnaudb) [09:23:27] (03CR) 10Dpogorzelski: [C:03+1] docker_registry: simplify and improve the /v2/ comment [puppet] - 10https://gerrit.wikimedia.org/r/1229143 (owner: 10Elukey) [09:24:35] (03PS1) 10Dpogorzelski: ml-build: add ml-team-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1230252 [09:25:33] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1230252 (owner: 10Dpogorzelski) [09:26:44] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.12 refs T413803 [09:26:49] T413803: 1.46.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T413803 [09:29:24] (03PS1) 10Marostegui: dbproxy2007: Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1230254 (https://phabricator.wikimedia.org/T414656) [09:30:46] (03CR) 10Marostegui: [C:03+2] dbproxy2007: Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1230254 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [09:31:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy2007.codfw.wmnet with OS trixie [09:32:00] (03CR) 10Dpogorzelski: [C:03+2] ml-build: add ml-team-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1230252 (owner: 10Dpogorzelski) [09:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:47] (03CR) 10Filippo Giunchedi: [C:03+1] dumps: rsync: Simplify configuration handling logic [puppet] - 10https://gerrit.wikimedia.org/r/1229600 (owner: 10Majavah) [09:35:22] (03CR) 10Elukey: "Adding Dawid as FYI." [puppet] - 10https://gerrit.wikimedia.org/r/1229198 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [09:37:34] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230255 (https://phabricator.wikimedia.org/T413803) [09:37:37] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230255 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [09:38:25] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230255 (https://phabricator.wikimedia.org/T413803) (owner: 10TrainBranchBot) [09:44:16] (03PS1) 10Elukey: profile::pyrra: add second SLO for Abstract Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/1230259 (https://phabricator.wikimedia.org/T415067) [09:44:35] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.12 refs T413803 [09:44:42] T413803: 1.46.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T413803 [09:45:05] (03CR) 10Aklapper: [C:04-1] "-1 as patchset2 is unrelated to the commit summary. Please restore patchset1 - thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [09:45:29] (03PS1) 10Bartosz Wójtowicz: ml-services: Update outlink article topic model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230260 (https://phabricator.wikimedia.org/T414573) [09:46:34] (03CR) 10Elukey: "Hey folks! I think there is still value in seeing how this SLO is displayed in a Pyrra dashboard since we are not 100% ready yet with Slot" [puppet] - 10https://gerrit.wikimedia.org/r/1230259 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [09:46:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2007.codfw.wmnet with reason: host reimage [09:48:17] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:48:42] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:50:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2007.codfw.wmnet with reason: host reimage [09:51:52] (03PS4) 10Muehlenhoff: nftables: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219877 [09:53:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219877 (owner: 10Muehlenhoff) [09:55:09] (03PS2) 10Elukey: profile::pyrra: add second SLO for Abstract Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/1230259 (https://phabricator.wikimedia.org/T415067) [10:14:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2007.codfw.wmnet with OS trixie [10:15:24] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11544415 (10elukey) Hi @Mvolz! I think that [[ https://thanos.wikimedia.org/graph?g0.expr=sum(... [10:17:15] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11544417 (10elukey) 05Open→03Resolved [10:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:35:15] !log installing systemd bugfix updates from Bookworm point release [10:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:18] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker1006.eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [10:43:36] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11544445 (10ops-monitoring-bot) Roll-reboot of nodes in dse-eqiad cluster started by btullis: * dse-k8... [10:44:16] (03CR) 10Majavah: [V:03+1 C:03+2] dumps: rsync: Simplify configuration handling logic [puppet] - 10https://gerrit.wikimedia.org/r/1229600 (owner: 10Majavah) [10:45:04] 06SRE, 07sre-alert-triage, 06Infrastructure-Foundations, 10Maps: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158#11544446 (10LSobanski) [10:46:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11544459 (10BTullis) >>! In T414460#11544162, @JAllemandou wrote: >>>! In T414460#11542216, @ops-monit... [10:47:30] (03PS1) 10Majavah: dumps::rsync::fragment: Fix variable access [puppet] - 10https://gerrit.wikimedia.org/r/1230276 [10:47:57] (03CR) 10Majavah: [C:03+2] dumps::rsync::fragment: Fix variable access [puppet] - 10https://gerrit.wikimedia.org/r/1230276 (owner: 10Majavah) [10:57:08] (03CR) 10Samtar: [C:03+1] Enable watchlist labels on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229946 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [10:57:57] 06SRE, 06Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11544493 (10cmooney) Just want to add my two cents on the problem we hit trying to make the IPv6 IPs live. * I personally think it's cleaner if the dns servers only have configured, and only listen on, the IPs that ar... [10:58:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7936/console" [puppet] - 10https://gerrit.wikimedia.org/r/1229633 (owner: 10Majavah) [10:59:53] (03PS1) 10Arnaudb: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.6 [puppet] - 10https://gerrit.wikimedia.org/r/1230278 (https://phabricator.wikimedia.org/T415214) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1100) [11:01:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1230278 (https://phabricator.wikimedia.org/T415214) (owner: 10Arnaudb) [11:01:25] (03PS2) 10Daniel Kinzler: redioscope: fix survey generation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229648 [11:01:40] (03CR) 10Arnaudb: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.6 [puppet] - 10https://gerrit.wikimedia.org/r/1230278 (https://phabricator.wikimedia.org/T415214) (owner: 10Arnaudb) [11:08:45] 06SRE, 07sre-alert-triage, 06Infrastructure-Foundations, 10Maps: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158#11544542 (10elukey) @MoritzMuehlenhoff is it possible that imposm got into the deadlock bug when we initialized it the la... [11:08:53] !log a-pizzata@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [11:08:58] !log a-pizzata@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [11:10:43] (03CR) 10Kamila Součková: [C:03+1] "LGTM for the next iteration :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229648 (owner: 10Daniel Kinzler) [11:11:18] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11544544 (10MoritzMuehlenhoff) p:05Triage→03High [11:11:27] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11544546 (10MoritzMuehlenhoff) 05Open→03Resolved All cleaned up [11:13:49] (03CR) 10Filippo Giunchedi: [C:03+1] hieradata: cloudceph: Set prefix length as an integer [puppet] - 10https://gerrit.wikimedia.org/r/1229633 (owner: 10Majavah) [11:14:24] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: cloudceph: Set prefix length as an integer [puppet] - 10https://gerrit.wikimedia.org/r/1229633 (owner: 10Majavah) [11:16:41] (03PS4) 10Majavah: interface::ip: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1229624 [11:16:41] (03PS5) 10Majavah: interface::ip: Fix default prefix length for IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1229625 [11:16:41] (03PS2) 10Majavah: interface::ip: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1229634 [11:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:19:55] (03CR) 10Daniel Kinzler: [C:03+2] redioscope: fix survey generation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229648 (owner: 10Daniel Kinzler) [11:20:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2077.codfw.wmnet with OS bullseye [11:20:48] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11544557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2077.codfw.wmnet with OS bullseye [11:21:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2077 [11:21:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{dse-k8s-worker1006.eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [11:21:27] (03PS1) 10Dpogorzelski: ml-builder-docker: add group [puppet] - 10https://gerrit.wikimedia.org/r/1230280 [11:21:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1229624 (owner: 10Majavah) [11:21:50] (03Merged) 10jenkins-bot: redioscope: fix survey generation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229648 (owner: 10Daniel Kinzler) [11:22:13] (03CR) 10CI reject: [V:04-1] ml-builder-docker: add group [puppet] - 10https://gerrit.wikimedia.org/r/1230280 (owner: 10Dpogorzelski) [11:22:17] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:22:28] (03PS2) 10Dpogorzelski: ml-builder-docker: add group [puppet] - 10https://gerrit.wikimedia.org/r/1230280 [11:23:44] (03CR) 10Elukey: "It looks good to me, let's see what Moritz thinks about it." [puppet] - 10https://gerrit.wikimedia.org/r/1230280 (owner: 10Dpogorzelski) [11:25:17] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7937/console" [puppet] - 10https://gerrit.wikimedia.org/r/1229624 (owner: 10Majavah) [11:25:20] (03CR) 10Clément Goubert: [C:04-1] "Needs the edit to be properly applied." [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) (owner: 10Jasmine) [11:25:30] (03CR) 10Majavah: [V:03+1 C:03+2] interface::ip: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1229624 (owner: 10Majavah) [11:28:10] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2077 - mvernon@cumin2002" [11:28:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2077 - mvernon@cumin2002" [11:28:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:28:17] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2077.codfw.wmnet 238.32.192.10.in-addr.arpa 8.3.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:28:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2077.codfw.wmnet 238.32.192.10.in-addr.arpa 8.3.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:28:21] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2077 [11:28:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2077 [11:28:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2077 [11:29:28] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#11544573 (10Volans) @MLechvien-WMF I haven't work on this since the original unplanned effort that generated the above patc... [11:29:45] 06SRE, 06Release-Engineering-Team, 10Scap, 06ServiceOps new, and 3 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11544575 (10Clement_Goubert) @Blake You can now implement the solution from https://phabricator.wikimedia.org/T415062#11538372 [11:31:08] !log daniel@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/redioscope: apply [11:31:18] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230283 [11:31:40] PROBLEM - Druid coordinator on an-druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:32:27] !log daniel@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/redioscope: apply [11:32:44] PROBLEM - Druid coordinator on an-druid1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:33:12] PROBLEM - Druid coordinator on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:34:40] RECOVERY - Druid coordinator on an-druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:34:44] RECOVERY - Druid coordinator on an-druid1004 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:34:52] 06SRE, 07sre-alert-triage, 06Infrastructure-Foundations, 10Maps: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158#11544599 (10MoritzMuehlenhoff) >>! In T399158#11544541, @elukey wrote: > @MoritzMuehlenhoff is it possible that imposm go... [11:35:12] RECOVERY - Druid coordinator on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:35:58] 06SRE, 06Release-Engineering-Team, 10Scap, 06ServiceOps new, and 3 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11544602 (10Blake) @Clement_Goubert That change should be in the most recent patch set of https://gerrit.wikimedia.org/r/c/ope... [11:39:41] (03CR) 10Clément Goubert: "Both are used but `runuser` is used a little more, I don't see an issue with either, your call." [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [11:40:24] 06SRE, 06Release-Engineering-Team, 10Scap, 06ServiceOps new, and 3 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11544608 (10Clement_Goubert) >>! In T330996#11544602, @Blake wrote: > @Clement_Goubert That change should be in the most recen... [11:40:44] (03CR) 10Blake: "Done" [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [11:42:44] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [11:46:25] jouncebot: nowandnext [11:46:25] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1100) [11:46:25] In 1 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1300) [11:46:39] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2077.codfw.wmnet with OS bullseye [11:46:50] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11544610 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2077.codfw.wmnet with OS bullseye execu... [11:47:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2077.codfw.wmnet with OS bullseye [11:47:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2077 [11:47:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2077 [11:47:20] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11544611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2077.codfw.wmnet with OS bullseye [11:48:13] hey folks, i'd like to lock scap for a few minutes to test a cookbook if there are no objections [11:51:32] (03PS5) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 [11:51:48] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [11:52:16] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade gitlab [11:52:16] (03PS1) 10Clément Goubert: failoid-ng: Add wikikube users [puppet] - 10https://gerrit.wikimedia.org/r/1230287 [11:52:52] alright, proceeding with the test, so scap will be locked for a moment [11:54:18] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-lock-scap for datacenter switchover from codfw to eqiad [11:54:20] !log root@deploy2002 Locking from deployment [ALL REPOSITORIES]: Datacenter switchover from codfw to eqiad - T330996 [11:54:21] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-lock-scap (exit_code=0) for datacenter switchover from codfw to eqiad [11:54:25] T330996: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996 [11:54:40] 06SRE, 06Release-Engineering-Team, 10Scap, 06ServiceOps new, and 3 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11544633 (10ops-monitoring-bot) blake@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.00-lock-scap for datacenter switch... [11:54:57] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-unlock-scap for datacenter switchover from codfw to eqiad [11:54:59] !log root@deploy2002 Forcefully removing global lock: Datacenter switchover from codfw to eqiad - T12345 [11:54:59] !log root@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter switchover from codfw to eqiad - T330996 (duration: 00m 39s) [11:55:01] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-unlock-scap (exit_code=0) for datacenter switchover from codfw to eqiad [11:55:04] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [11:55:32] !log Run of script for T413868 has finished on s8 [11:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:36] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [11:56:47] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-unlock-scap for datacenter switchover from codfw to eqiad [11:56:49] !log root@deploy2002 Forcefully removing global lock: Datacenter switchover from codfw to eqiad - T330996 [11:56:50] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-unlock-scap (exit_code=0) for datacenter switchover from codfw to eqiad [11:57:10] 06SRE, 06Release-Engineering-Team, 10Scap, 06ServiceOps new, and 3 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11544669 (10ops-monitoring-bot) blake@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.09-unlock-scap for datacenter swit... [11:57:30] (03CR) 10Clément Goubert: [C:03+1] sre.switchdc.mediawiki: Automate scap lock/unlock [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [11:57:31] (03CR) 10Majavah: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [11:57:46] my testing is complete, scap is now unlocked. thanks! [11:57:57] (03CR) 10Clément Goubert: [C:03+1] failoid-ng: add namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229588 (owner: 10Kamila Součková) [11:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:00:15] (03CR) 10Blake: [C:03+2] sre.switchdc.mediawiki: Automate scap lock/unlock [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [12:00:16] !log Run of script for T413868 has finished on s4 [12:00:17] !log Run of script for T413868 has finished on s1 [12:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:25] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade gitlab [12:04:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [12:05:32] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2077.codfw.wmnet with reason: host reimage [12:06:01] !log kamila@deploy2002 Started deploy [restbase/deploy@dcc15be]: Add kaiwiki, kajwiki & pplwiki - T414238, T415039, T415047 [12:06:09] T414238: Add kaiwiki to RESTBase - https://phabricator.wikimedia.org/T414238 [12:06:09] T415039: Add kajwiki to RESTBase - https://phabricator.wikimedia.org/T415039 [12:06:09] T415047: Add pplwiki to RESTBase - https://phabricator.wikimedia.org/T415047 [12:06:37] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Automate scap lock/unlock [cookbooks] - 10https://gerrit.wikimedia.org/r/1229076 (https://phabricator.wikimedia.org/T330996) (owner: 10Blake) [12:07:41] FIRING: [18x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:09:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2077.codfw.wmnet with reason: host reimage [12:09:39] 06SRE, 06Release-Engineering-Team, 10Scap, 06ServiceOps new, and 3 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11544713 (10Blake) 05Open→03Resolved [12:10:11] (03CR) 10Kamila Součková: [C:03+1] failoid-ng: Add wikikube users [puppet] - 10https://gerrit.wikimedia.org/r/1230287 (owner: 10Clément Goubert) [12:11:50] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Remove Puppet 5 CA cert from wmf-certificates cert bundle - https://phabricator.wikimedia.org/T415255 (10MoritzMuehlenhoff) 03NEW [12:12:41] FIRING: [111x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:12:41] (03CR) 10Clément Goubert: [C:03+2] failoid-ng: add namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229588 (owner: 10Kamila Součková) [12:12:54] (03CR) 10Clément Goubert: [C:03+2] failoid-ng: Add wikikube users [puppet] - 10https://gerrit.wikimedia.org/r/1230287 (owner: 10Clément Goubert) [12:20:25] (03Merged) 10jenkins-bot: failoid-ng: add namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229588 (owner: 10Kamila Součková) [12:22:15] !log kamila@deploy2002 Finished deploy [restbase/deploy@dcc15be]: Add kaiwiki, kajwiki & pplwiki - T414238, T415039, T415047 (duration: 16m 14s) [12:22:23] T414238: Add kaiwiki to RESTBase - https://phabricator.wikimedia.org/T414238 [12:22:23] T415039: Add kajwiki to RESTBase - https://phabricator.wikimedia.org/T415039 [12:22:23] T415047: Add pplwiki to RESTBase - https://phabricator.wikimedia.org/T415047 [12:22:41] FIRING: [111x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:23:57] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:25:59] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:26:30] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:26:38] (03CR) 10Silvan Heintze: "Hi @btullis@wikimedia.org and @brouberol@wikimedia.org, would you mind taking a look at this change? It has been reviewed by our team, but" [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) (owner: 10Silvan Heintze) [12:27:30] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:27:33] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230294 [12:27:41] RESOLVED: [111x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:27:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:28:07] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:29:50] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:30:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2077.codfw.wmnet with OS bullseye [12:30:14] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:30:25] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11544777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2077.codfw.wmnet... [12:31:08] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:32:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:36:36] (03PS1) 10Isabelle Hurbain-Palatin: Explicitly disable postprocessing cache for wikidata and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230308 (https://phabricator.wikimedia.org/T415111) [12:38:36] (03PS1) 10Daniel Kinzler: redioscope: remove suffix from redis urls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230309 [12:39:16] (03CR) 10Kamila Součková: [C:03+1] redioscope: remove suffix from redis urls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230309 (owner: 10Daniel Kinzler) [12:41:01] (03CR) 10Kamila Součková: [C:03+2] redioscope: remove suffix from redis urls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230309 (owner: 10Daniel Kinzler) [12:41:41] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230314 [12:41:49] (03CR) 10Brouberol: [C:03+1] Report # of skipped entities by type [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) (owner: 10Silvan Heintze) [12:42:01] (03CR) 10Brouberol: [C:03+2] "LHTM!" [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) (owner: 10Silvan Heintze) [12:42:13] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230315 [12:42:21] (03Merged) 10jenkins-bot: Report # of skipped entities by type [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) (owner: 10Silvan Heintze) [12:42:59] (03Merged) 10jenkins-bot: redioscope: remove suffix from redis urls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230309 (owner: 10Daniel Kinzler) [12:47:28] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:47:28] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:48:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:48:28] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:48:28] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:49:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:53:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:54:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:58:33] (03PS1) 10Btullis: Configure druid clusters to reuse their /srv volume [puppet] - 10https://gerrit.wikimedia.org/r/1230321 (https://phabricator.wikimedia.org/T278056) [12:58:50] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11544877 (10MatthewVernon) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1300) [13:00:07] (03PS1) 10MVernon: Restore 3 reimaged hosts to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1230323 (https://phabricator.wikimedia.org/T354872) [13:01:40] (03CR) 10Brouberol: [C:03+1] "Ooh, that's why 2 of the 5 druid hosts kept their data. That makes sense!" [puppet] - 10https://gerrit.wikimedia.org/r/1230321 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [13:06:13] (03CR) 10Btullis: [C:03+2] Configure druid clusters to reuse their /srv volume [puppet] - 10https://gerrit.wikimedia.org/r/1230321 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [13:06:47] (03CR) 10Jcrespo: [C:03+1] Restore 3 reimaged hosts to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1230323 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [13:07:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11544940 (10MatthewVernon) p:05High→03Medium @jhathaway I did ms-be2077 today, and see the same failure mode - it failed enti... [13:11:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11544951 (10MoritzMuehlenhoff) [13:13:31] !log installing e2fsprogs updates from Bookworm point release [13:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:59] !log kamila@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/redioscope: apply [13:14:15] !log kamila@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/redioscope: apply [13:18:22] !log kamila@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [13:18:32] !log kamila@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [13:23:47] (03CR) 10MVernon: [C:03+2] Restore 3 reimaged hosts to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1230323 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [13:27:40] 06SRE, 06cloud-services-team: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#11544977 (10jijiki) [13:32:10] (03PS1) 10Muehlenhoff: Remove puppetmaster role from puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/1230331 (https://phabricator.wikimedia.org/T365798) [13:32:12] (03PS1) 10Muehlenhoff: Remove puppetmaster2001 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230332 (https://phabricator.wikimedia.org/T365798) [13:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:38:05] 06SRE, 06cloud-services-team, 10Cloud-VPS: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#11545004 (10taavi) [13:38:26] (03CR) 10Joal: [C:03+1] "I don't know anything about that, but looks good functionally :)" [puppet] - 10https://gerrit.wikimedia.org/r/1230321 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [13:38:49] (03PS2) 10Muehlenhoff: Remove puppetmaster role from puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/1230331 (https://phabricator.wikimedia.org/T365798) [13:38:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1230331 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:39:13] (03PS2) 10Muehlenhoff: Remove puppetmaster2001 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230332 (https://phabricator.wikimedia.org/T365798) [13:39:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11545008 (10MoritzMuehlenhoff) [13:40:03] (03PS1) 10Majavah: hieradata: openstack: Use dedicated memcache user in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1230334 (https://phabricator.wikimedia.org/T273950) [13:40:06] (03PS1) 10Majavah: hieradata: openstack: Use dedicated memcache user in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1230335 (https://phabricator.wikimedia.org/T273950) [13:40:08] (03PS1) 10Majavah: hieradata: Use dedicated memcache user by default in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1230336 (https://phabricator.wikimedia.org/T273950) [13:41:54] !log installing libgd2 security updates [13:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:43] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7939/co" [puppet] - 10https://gerrit.wikimedia.org/r/1230334 (https://phabricator.wikimedia.org/T273950) (owner: 10Majavah) [13:44:21] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [13:46:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [13:48:18] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [13:48:36] (03CR) 10Ozge: [C:03+1] ml-services: Update outlink article topic model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230260 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [13:48:44] (03CR) 10Ladsgroup: [C:03+1] Explicitly disable postprocessing cache for wikidata and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230308 (https://phabricator.wikimedia.org/T415111) (owner: 10Isabelle Hurbain-Palatin) [13:50:18] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [13:51:42] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [13:52:31] (03PS1) 10Kamila Součková: failoid-ng: add a deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230326 [13:52:31] (03CR) 10Kamila Součková: "still needs the image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230326 (owner: 10Kamila Součková) [13:54:04] jouncebot: nowandnext [13:54:04] For the next 0 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1300) [13:54:04] In 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1400) [13:54:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230308 (https://phabricator.wikimedia.org/T415111) (owner: 10Isabelle Hurbain-Palatin) [13:56:44] 06SRE, 06DC-Ops, 06serviceops: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11545085 (10MLechvien-WMF) @jasmine_ are you doing this task? Please ask others if you don't find the capacity [13:57:23] 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11545090 (10MLechvien-WMF) [13:58:24] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1400). [14:00:05] ihurbain: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:19] yaaaaaaay. [14:00:55] i can deploy on my own; i'm assuming the coast is clear but i'll give a few minutes for someone to raise their hand if it's not. [14:01:40] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update outlink article topic model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230260 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [14:02:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [14:02:16] !log installing krb5 security updates [14:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:03:14] !ack [14:03:15] 7358 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [14:03:26] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 85.44 ms [14:03:29] (03Merged) 10jenkins-bot: ml-services: Update outlink article topic model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230260 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [14:03:39] are there network issues? [14:03:46] checking [14:05:15] volans: I am in a meeting, I'll wait a little to join if you don't mind, in case it is something big please ping me and I'll join straight away [14:05:31] elukey: ack, no prob, checking dashboards right now [14:06:50] (03CR) 10Ssingh: "Thanks for the patch, I think this is a good idea. Since I am assuming there are lots of hosts using interface::ip (655 per cumin), should" [puppet] - 10https://gerrit.wikimedia.org/r/1229625 (owner: 10Majavah) [14:07:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:08:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, memcached is the default user in the Debian packaging anyway" [puppet] - 10https://gerrit.wikimedia.org/r/1230334 (https://phabricator.wikimedia.org/T273950) (owner: 10Majavah) [14:08:54] okay, starting spiderpig [14:09:08] there was a spike of tcp.timed_out NEL events mostly from UK already recovering [14:09:09] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: openstack: Use dedicated memcache user in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1230334 (https://phabricator.wikimedia.org/T273950) (owner: 10Majavah) [14:09:39] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: openstack: Use dedicated memcache user in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1230334 (https://phabricator.wikimedia.org/T273950) (owner: 10Majavah) [14:09:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230308 (https://phabricator.wikimedia.org/T415111) (owner: 10Isabelle Hurbain-Palatin) [14:10:30] (03CR) 10Muehlenhoff: hieradata: Use dedicated memcache user by default in Cloud VPS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1230336 (https://phabricator.wikimedia.org/T273950) (owner: 10Majavah) [14:10:50] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7940/" [puppet] - 10https://gerrit.wikimedia.org/r/1229625 (owner: 10Majavah) [14:10:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1230332 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:11:06] (03Merged) 10jenkins-bot: Explicitly disable postprocessing cache for wikidata and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230308 (https://phabricator.wikimedia.org/T415111) (owner: 10Isabelle Hurbain-Palatin) [14:11:23] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1230308|Explicitly disable postprocessing cache for wikidata and commons (T415111)]] [14:11:29] T415111: Explicitly deactivate the post-processing cache on Commons and Wikidata - https://phabricator.wikimedia.org/T415111 [14:11:30] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:12:34] (03CR) 10Majavah: [V:03+1] "The primary reason this is so widely used is that it's used in `interface::add_ip6_mapped` which in turn is used on most of the fleet. The" [puppet] - 10https://gerrit.wikimedia.org/r/1229625 (owner: 10Majavah) [14:13:25] !log ihurbain@deploy2002 ihurbain: Backport for [[gerrit:1230308|Explicitly disable postprocessing cache for wikidata and commons (T415111)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:13:38] elukey: self-resolved, check notes for detail, no need to drop off the meeting [14:13:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:14:30] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:17:25] !log ihurbain@deploy2002 ihurbain: Continuing with sync [14:17:36] volans: <3 [14:18:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:18:51] (03PS1) 10Majavah: utils/pcc: Add bash completion script [puppet] - 10https://gerrit.wikimedia.org/r/1230341 [14:21:35] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230308|Explicitly disable postprocessing cache for wikidata and commons (T415111)]] (duration: 10m 12s) [14:21:40] T415111: Explicitly deactivate the post-processing cache on Commons and Wikidata - https://phabricator.wikimedia.org/T415111 [14:23:33] okay, i'm finished; there's no other patch listed for deployment window, so it's free [14:25:30] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:26:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:27:22] (03CR) 10Ssingh: [C:03+1] "Yeah looks good. Let's do it. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1229625 (owner: 10Majavah) [14:27:41] (03CR) 10Majavah: [V:03+1 C:03+2] interface::ip: Fix default prefix length for IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1229625 (owner: 10Majavah) [14:28:27] (03PS3) 10Majavah: interface::ip: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1229634 [14:28:30] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:28:32] (03PS4) 10Federico Ceratto: admin: Add johannnes89 to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) [14:29:45] (03CR) 10Federico Ceratto: "Updated email and comment" [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) (owner: 10Federico Ceratto) [14:31:08] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970#11545202 (10Gehel) [14:31:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:31:20] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970#11545206 (10Gehel) p:05Triage→03High [14:33:30] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413#11545220 (10Gehel) p:05Triage→03High [14:33:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) (owner: 10Federico Ceratto) [14:34:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1229634 (owner: 10Majavah) [14:34:29] (03CR) 10Majavah: [C:03+2] interface::ip: Use consistent indentation [puppet] - 10https://gerrit.wikimedia.org/r/1229634 (owner: 10Majavah) [14:34:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:28] (03PS1) 10Ssingh: Revert^2 "dnsbox: advertise ns[0-2] IPv6" [puppet] - 10https://gerrit.wikimedia.org/r/1230351 [14:36:40] (03PS1) 10Majavah: icinga: Remove unused check_exim_queue script [puppet] - 10https://gerrit.wikimedia.org/r/1230352 [14:36:48] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11545247 (10Gehel) [14:37:15] (03PS2) 10Majavah: icinga: Remove unused check_exim_queue script [puppet] - 10https://gerrit.wikimedia.org/r/1230352 [14:37:51] (03PS2) 10Ssingh: dnsbox: advertise ns[0-2] IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1230351 (https://phabricator.wikimedia.org/T81605) [14:39:30] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7941/co" [puppet] - 10https://gerrit.wikimedia.org/r/1230351 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [14:41:59] (03PS2) 10Clément Goubert: failoid-ng: add a deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230326 (owner: 10Kamila Součková) [14:42:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216#11545300 (10Gehel) [14:46:35] (03CR) 10Federico Ceratto: [C:03+2] admin: Add johannnes89 to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/1229200 (https://phabricator.wikimedia.org/T414789) (owner: 10Federico Ceratto) [14:48:41] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11545331 (10FCeratto-WMF) 05Open→03Resolved a:03FCeratto-WMF Change deployed, closing task. Please reopen it if there's any issue. [14:49:46] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11545351 (10ssingh) https://puppet-compiler.wmflabs.org/output/1230351/7941/ ` Interface::Ip[ns1-v6] Exec[ip addr add 2620:0:861:53::1/128 label lo:anycast dev lo] Augeas[lo_2620:0:8... [14:50:26] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 4258.99 ms [14:51:58] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:53:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) (owner: 10Kgraessle) [14:55:49] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [14:57:40] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:59:13] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Follow up on lists.wm.o TLS usage - https://phabricator.wikimedia.org/T398018#11545411 (10ABran-WMF) Just an update to see if there is any planned work for this task. It could maybe be discarded because of {T286066} and {T378028} (or maybe a subseq... [14:59:16] !log mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 5 thumbsize (T406724) [14:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:21] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [15:02:09] (03CR) 10Filippo Giunchedi: [C:03+1] icinga: Remove unused check_exim_queue script [puppet] - 10https://gerrit.wikimedia.org/r/1230352 (owner: 10Majavah) [15:02:22] (03CR) 10Majavah: [C:03+2] icinga: Remove unused check_exim_queue script [puppet] - 10https://gerrit.wikimedia.org/r/1230352 (owner: 10Majavah) [15:03:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11545444 (10BTullis) Hi @Jclark-ctr - I've unmounted the problematic drive. You can replace it whenever is convenient for you. Thanks. [15:07:28] 06SRE, 07sre-alert-triage, 06Infrastructure-Foundations, 10Maps: Alert in need of triage: OsmSynchronisationLag (instance maps-test2001:9100) - https://phabricator.wikimedia.org/T399158#11545482 (10elukey) 05Open→03Resolved a:03elukey @MoritzMuehlenhoff you are totally right, I got sidetracked by... [15:08:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11545492 (10herron) Thanks so much for sorting through this @Papaul and @Jclark-ctr! Yes looks good to me, ready to revert to the reuse variant. Thanks again! [15:09:02] 06SRE, 06DBA, 06Infrastructure-Foundations, 10netops, 10observability: librenms.syslog table size - https://phabricator.wikimedia.org/T349362#11545494 (10Marostegui) [15:12:17] (03CR) 10Clément Goubert: [C:03+2] failoid-ng: add a deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230326 (owner: 10Kamila Součková) [15:14:11] (03Merged) 10jenkins-bot: failoid-ng: add a deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230326 (owner: 10Kamila Součková) [15:16:29] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [15:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:19:52] (03CR) 10Elukey: [C:03+1] Remove puppetmaster role from puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/1230331 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:20:33] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [15:20:55] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [15:27:46] (03PS1) 10Clément Goubert: failoid-ng: Fix port and gunicorn arguments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230368 [15:28:05] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [15:29:08] (03PS2) 10Clément Goubert: failoid-ng: Fix port and gunicorn arguments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230368 [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1530) [15:32:19] (03PS1) 10Papaul: Revert to use the old partman for mwlog nodes [puppet] - 10https://gerrit.wikimedia.org/r/1230369 (https://phabricator.wikimedia.org/T412230) [15:32:37] (03CR) 10Kamila Součková: [C:03+1] failoid-ng: Fix port and gunicorn arguments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230368 (owner: 10Clément Goubert) [15:33:50] (03CR) 10Clément Goubert: [C:03+2] failoid-ng: Fix port and gunicorn arguments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230368 (owner: 10Clément Goubert) [15:35:05] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:35:33] (03Merged) 10jenkins-bot: failoid-ng: Fix port and gunicorn arguments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230368 (owner: 10Clément Goubert) [15:35:47] (03CR) 10Papaul: [C:03+2] Revert to use the old partman for mwlog nodes [puppet] - 10https://gerrit.wikimedia.org/r/1230369 (https://phabricator.wikimedia.org/T412230) (owner: 10Papaul) [15:35:53] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:35:56] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [15:36:00] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:36:15] (03CR) 10Elukey: [C:03+2] role::puppetserver: add the analytics-sre user key and configs [puppet] - 10https://gerrit.wikimedia.org/r/1229590 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:36:46] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:37:13] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [15:37:27] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [15:38:43] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11545647 (10Papaul) 05Open→03Resolved @herron complete closing the task. Thank you. [15:42:22] (03PS1) 10Clément Goubert: failoid_ng: fix app port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230373 [15:44:38] (03CR) 10Clément Goubert: [C:03+2] failoid_ng: fix app port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230373 (owner: 10Clément Goubert) [15:46:23] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:46:28] (03Merged) 10jenkins-bot: failoid_ng: fix app port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230373 (owner: 10Clément Goubert) [15:46:29] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:46:51] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [15:46:57] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [15:47:06] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [15:52:07] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [15:52:12] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [15:53:12] (03CR) 10BCornwall: utils/pcc: Add bash completion script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1230341 (owner: 10Majavah) [15:53:57] PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100% [15:56:18] (03PS1) 10Clément Goubert: failoid_ng: Multiple releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230375 [15:56:25] RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [15:57:41] (03CR) 10Clément Goubert: [C:03+2] failoid_ng: Multiple releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230375 (owner: 10Clément Goubert) [15:58:23] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet2006-dev is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:59:26] (03Merged) 10jenkins-bot: failoid_ng: Multiple releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230375 (owner: 10Clément Goubert) [15:59:57] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [15:59:59] (03CR) 10Brouberol: [C:03+2] eventgate-analytics: increase instances to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227392 (https://phabricator.wikimedia.org/T411454) (owner: 10Milimetric) [16:00:04] andre and jeena: Time to snap out of that daydream and deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1600). [16:00:09] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [16:01:40] (03CR) 10Btullis: [C:03+1] eventgate-analytics: increase instances to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227392 (https://phabricator.wikimedia.org/T411454) (owner: 10Milimetric) [16:02:09] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: MediaSearch should stop relying on render map config - https://phabricator.wikimedia.org/T415282 (10MatthewVernon) 03NEW [16:02:41] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11545791 (10MatthewVernon) p:05Triage→03High [16:02:42] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: MediaSearch should stop relying on render map config - https://phabricator.wikimedia.org/T415282#11545792 (10MatthewVernon) p:05Triage→03High [16:04:33] 10SRE-swift-storage, 06Data-Persistence, 10MediaSearch, 10Thumbor, 06Traffic: MediaSearch should stop relying on render map config - https://phabricator.wikimedia.org/T415282#11545794 (10Ladsgroup) To emphasize: MediaSearch does respond with standard sizes but the js uses a config that should be used. [16:07:20] jouncebot: nowandnext [16:07:21] For the next 0 hour(s) and 52 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1600) [16:07:21] In 0 hour(s) and 52 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1700) [16:07:46] andre and jeena: Are you using this window? If not, I want to backport something :D [16:08:02] (03PS1) 10Ladsgroup: TimedMediaThumbnail: Set physical width and height [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230378 (https://phabricator.wikimedia.org/T402792) [16:08:12] Amir1: It's a Logstash triage window, no deployment window [16:08:42] so whatever you may deploy, I wish you the best of luck :D [16:09:23] 06SRE, 10MW-on-K8s, 06ServiceOps new, 10ServiceOps-SharedInfra: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11545809 (10Scott_French) Since this is fundamentally the same class of failure mode as already tracked and reported in T390251, I... [16:09:29] <3 [16:09:37] (03CR) 10Ladsgroup: [C:03+2] TimedMediaThumbnail: Set physical width and height [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230378 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [16:09:53] 06SRE, 10MW-on-K8s, 06ServiceOps new, 10ServiceOps-SharedInfra: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11545811 (10Scott_French) →14Duplicate dup:03T390251 [16:15:14] (03CR) 10Majavah: [C:03+1] dnsbox: advertise ns[0-2] IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1230351 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:17:14] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2354.codfw.wmnet with OS bookworm [16:17:32] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11545826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2354.codfw.wmnet with OS book... [16:18:31] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2356.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:22:04] (03Merged) 10jenkins-bot: TimedMediaThumbnail: Set physical width and height [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230378 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [16:25:25] jhancock@cumin1003 provision (PID 3101921) is awaiting input [16:27:38] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1230378|TimedMediaThumbnail: Set physical width and height (T402792)]] [16:27:43] T402792: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792 [16:29:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [16:29:42] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1230378|TimedMediaThumbnail: Set physical width and height (T402792)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:32:26] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:33:40] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [16:34:07] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2354.codfw.wmnet with OS bookworm [16:34:10] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7942/co" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn) [16:34:24] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11545933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2354.codfw.wmnet with OS bookworm... [16:36:42] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230378|TimedMediaThumbnail: Set physical width and height (T402792)]] (duration: 09m 04s) [16:36:47] T402792: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792 [16:40:15] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2354.codfw.wmnet with OS bookworm [16:40:29] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11545954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2354.codfw.wmnet with OS book... [16:52:52] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [16:55:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11545975 (10MoritzMuehlenhoff) [16:59:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [17:00:05] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1700) [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11546016 (10Jclark-ctr) a:05herron→03Jclark-ctr [17:02:28] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet2006-dev is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:02:34] (03CR) 10BCornwall: [V:03+1 C:03+1] varnish: remove ancient Noise rule from text-frontend VCL [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn) [17:04:25] FIRING: [3x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:38] (03PS2) 10Majavah: utils/pcc: Add bash completion script [puppet] - 10https://gerrit.wikimedia.org/r/1230341 [17:11:49] (03CR) 10Majavah: utils/pcc: Add bash completion script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1230341 (owner: 10Majavah) [17:11:56] PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100% [17:14:24] RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [17:17:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2356.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:17:10] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:17:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:17:31] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2354.codfw.wmnet with OS bookworm [17:17:48] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11546115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2354.codfw.wmnet with OS bookworm... [17:20:08] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2355.codfw.wmnet with OS bookworm [17:20:29] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11546150 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2355.codfw.wmnet with OS book... [17:21:28] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:21:28] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:21:31] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11546154 (10RKemper) [17:21:34] (03PS1) 10Ryan Kemper: wdqs: make avail SLOs dc & svc agnostic [puppet] - 10https://gerrit.wikimedia.org/r/1230399 (https://phabricator.wikimedia.org/T393966) [17:22:06] (03CR) 10CI reject: [V:04-1] wdqs: make avail SLOs dc & svc agnostic [puppet] - 10https://gerrit.wikimedia.org/r/1230399 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [17:24:07] (03PS2) 10Ryan Kemper: wdqs: make avail SLOs dc & svc agnostic [puppet] - 10https://gerrit.wikimedia.org/r/1230399 (https://phabricator.wikimedia.org/T393966) [17:27:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:30:29] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:33:46] (03CR) 10Ryan Kemper: [C:03+2] wdqs: provide trueg root access [puppet] - 10https://gerrit.wikimedia.org/r/1227862 (https://phabricator.wikimedia.org/T414517) (owner: 10Ryan Kemper) [17:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:34:55] (03CR) 10Ssingh: [V:03+1] "With the Summit next week, plan is to deploy this on Mon Feb 2 and then change the glue records sometime during the same week." [puppet] - 10https://gerrit.wikimedia.org/r/1230351 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [17:35:22] 06SRE, 06ServiceOps new, 10TimedMediaHandler-Transcode: Increase capacity for Mercurius webvideoTranscode job (1080p) processing - https://phabricator.wikimedia.org/T414427#11546274 (10MLechvien-WMF) p:05Medium→03High [17:37:18] 06SRE, 06ServiceOps new, 10TimedMediaHandler-Transcode: Increase capacity for Mercurius webvideoTranscode job (1080p) processing - https://phabricator.wikimedia.org/T414427#11546298 (10MLechvien-WMF) a:03Raine Assigning to Raine to take a look [17:44:05] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2355.codfw.wmnet with reason: host reimage [17:49:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2355.codfw.wmnet with reason: host reimage [17:57:43] 10SRE-swift-storage, 06Data-Persistence, 10Prod-Kubernetes, 06ServiceOps new, and 4 others: Fix thumbor discovery records and make swift use them - https://phabricator.wikimedia.org/T397618#11546382 (10Clement_Goubert) Tagging #data-persistence so we can get opinions on if/how we actually *can* make swift... [18:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1800) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1800) [18:00:17] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2356.codfw.wmnet with OS bookworm [18:00:34] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11546397 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2356.codfw.wmnet with OS book... [18:02:24] Nothing to ship in my window this week. [18:07:00] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [18:09:16] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [18:09:17] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2355.codfw.wmnet with OS bookworm [18:09:37] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11546457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2355.codfw.wmnet with OS bookworm... [18:11:36] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2356.codfw.wmnet with reason: host reimage [18:13:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [18:13:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1251.eqiad.wmnet with reason: Maintenance [18:13:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87863 and previous config saved to /var/cache/conftool/dbconfig/20260122-181347-marostegui.json [18:13:55] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [18:13:55] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [18:18:25] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2356.codfw.wmnet with reason: host reimage [18:36:16] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [18:37:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:38:06] !ack [18:38:07] 7360 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [18:39:22] jhancock@cumin1003 reimage (PID 3116575) is awaiting input [18:42:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:50:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11546581 (10cmooney) >>! In T412525#11528293, @Jclark-ctr wrote: > @cmooney i have disconnected all the switches @Jclark-ctr I'm ha... [18:51:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11546599 (10Jclark-ctr) a:03Jclark-ctr [18:59:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [18:59:14] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2356.codfw.wmnet with OS bookworm [18:59:30] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11546614 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2356.codfw.wmnet with OS bookworm... [19:00:04] andre and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T1900). [19:00:27] jouncebot: Another stop? Hell no! [19:04:30] 10ops-codfw, 06SRE, 06DC-Ops: wikikube-worker2346 DOA - https://phabricator.wikimedia.org/T414708#11546622 (10Jhancock.wm) still not even a troubleshooting email from supermicro. i replied to the latest "we got your message" email to see if i can get their attention. [19:05:45] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11546623 (10Jhancock.wm) [19:06:42] (03CR) 10Xcollazo: "Dusting off this patchset" [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [19:08:13] (03PS11) 10Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) [19:10:49] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11546627 (10Jhancock.wm) @Clement_Goubert all of the servers except wikikube-worker2346 are installed and ready for you. that one was essentially dead out o... [19:12:55] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [19:13:53] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [19:18:30] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [19:19:03] PROBLEM - Juniper alarms on asw2-c-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.26 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:23:09] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:23:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11546641 (10VRiley-WMF) Hey @cmooney, I was able to check C2 and confirm there is a cable from the managment port on the switch to th... [19:24:25] FIRING: [5x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11546643 (10VRiley-WMF) Please disreguard, I thought we were talking about lsw1-c2 [19:28:57] (03PS4) 10Jasmine: charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 [19:29:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11546647 (10Jclark-ctr) [19:32:45] (03CR) 10Scott French: [C:03+1] charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 (owner: 10Jasmine) [19:32:54] (03CR) 10RLazarus: [C:03+1] charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 (owner: 10Jasmine) [19:33:53] (03CR) 10Jasmine: [C:03+2] charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 (owner: 10Jasmine) [19:35:07] (03Merged) 10jenkins-bot: charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 (owner: 10Jasmine) [19:36:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11546665 (10Reedy) [19:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:13:57] (03PS6) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 [20:26:05] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11546729 (10Johannnes89) 05Resolved→03In progress Thanks for working on this task! I'm not yet part of the NDA group (https://ldap.toolforge.org/group/nda) as requested which... [20:29:25] FIRING: [7x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:10] (03PS1) 10Jforrester: mcrouter: Allow configuring secondary replicated caches [puppet] - 10https://gerrit.wikimedia.org/r/1229229 (https://phabricator.wikimedia.org/T411807) [20:31:33] (03CR) 10Jforrester: "I think this should be fine to land; it orders the sets differently, but otherwise is a no-op?" [puppet] - 10https://gerrit.wikimedia.org/r/1229229 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [20:33:25] (03PS12) 10Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) [20:36:16] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [20:36:36] (03CR) 10RLazarus: [C:03+1] helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 (owner: 10Jasmine) [20:36:48] (03CR) 10Scott French: [C:03+1] helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 (owner: 10Jasmine) [20:37:33] (03CR) 10Jasmine: [C:03+2] helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 (owner: 10Jasmine) [20:39:20] (03Merged) 10jenkins-bot: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 (owner: 10Jasmine) [20:42:13] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [20:42:50] (03PS1) 10Daniel Kinzler: redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 [20:43:00] (03CR) 10CI reject: [V:04-1] redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [20:43:40] (03PS2) 10Daniel Kinzler: redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 [20:44:53] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:44:59] (03CR) 10Xcollazo: "Ok this patchset is ready for re-review @btullis@wikimedia.org and @joal@wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [20:46:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11546770 (10cmooney) Thanks for the help with this one guys! All the switches have been reset to factory defaults. So they can be re... [20:52:05] (03CR) 10RLazarus: [C:03+1] deploy: Add sophroid kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1230226 (owner: 10Jasmine) [20:52:51] (03CR) 10RLazarus: [C:03+2] deploy: Add sophroid kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1230226 (owner: 10Jasmine) [20:55:33] (03CR) 10Scott French: [C:03+1] aux-k8s: add sophroid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230108 (owner: 10Jasmine) [20:56:11] (03CR) 10RLazarus: [C:03+1] aux-k8s: add sophroid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230108 (owner: 10Jasmine) [20:57:44] (03CR) 10Jasmine: [C:03+2] aux-k8s: add sophroid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230108 (owner: 10Jasmine) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T2100). [21:00:05] JSherman and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] o/ [21:00:14] o/ [21:01:59] I can self deploy if needed [21:03:08] I suppose I'll get this thing rolling [21:03:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) (owner: 10Kgraessle) [21:03:55] Yep! And then, if you can, I've 3 patches to be deployed (even together) [21:04:23] Superpes: happy to do so; yes, I'll just roll them up [21:04:38] (03Merged) 10jenkins-bot: When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) (owner: 10Kgraessle) [21:04:57] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1224814|When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces (T404200)]] [21:04:58] Many thanks! I'm creating the last patch :)  Just 2-3 minutes left... So you can continue with your self-deploy! [21:05:02] T404200: When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces - https://phabricator.wikimedia.org/T404200 [21:05:34] (03Merged) 10jenkins-bot: aux-k8s: add sophroid namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230108 (owner: 10Jasmine) [21:06:54] !log jsn@deploy2002 kgraessle, jsn: Backport for [[gerrit:1224814|When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces (T404200)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:07:19] testing [21:09:03] (03PS1) 10Superpes15: [tgwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230449 (https://phabricator.wikimedia.org/T415307) [21:10:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T410589)', diff saved to https://phabricator.wikimedia.org/P87864 and previous config saved to /var/cache/conftool/dbconfig/20260122-211054-ladsgroup.json [21:11:00] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:13:49] !log jsn@deploy2002 kgraessle, jsn: Continuing with sync [21:18:02] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224814|When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces (T404200)]] (duration: 13m 05s) [21:18:07] T404200: When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces - https://phabricator.wikimedia.org/T404200 [21:19:35] Superpes: how are things looking? [21:19:46] I'm all done [21:19:49] I'm ready :) [21:19:56] mmk [21:20:02] I scheduled my 3 patches [21:20:31] mmk, giving them a quick look [21:21:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P87865 and previous config saved to /var/cache/conftool/dbconfig/20260122-212102-ladsgroup.json [21:23:04] Superpes: looks reasonably straightforward. I'll ping you when it's time to test! [21:23:35] Yep I'm here :) [21:23:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229130 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15) [21:23:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229132 (https://phabricator.wikimedia.org/T414736) (owner: 10Superpes15) [21:23:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230449 (https://phabricator.wikimedia.org/T415307) (owner: 10Superpes15) [21:25:11] (03Merged) 10jenkins-bot: [itwiki] Change tagline for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229130 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15) [21:25:23] (03Merged) 10jenkins-bot: [hawiki] Add a temporary wordmark and tagline for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229132 (https://phabricator.wikimedia.org/T414736) (owner: 10Superpes15) [21:25:26] (03Merged) 10jenkins-bot: [tgwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230449 (https://phabricator.wikimedia.org/T415307) (owner: 10Superpes15) [21:25:48] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1229130|[itwiki] Change tagline for Wikipedia 25 (T414320)]], [[gerrit:1229132|[hawiki] Add a temporary wordmark and tagline for Wikipedia 25 (T414736)]], [[gerrit:1230449|[tgwiki] Add a temporary logo for Wikipedia 25 (T415307)]] [21:25:56] T414320: Requesting temporary logo change for it.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414320 [21:25:57] T414736: Requesting temporary logo change for ha.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414736 [21:25:57] T415307: Requesting temporary logo change for tg.wikipedia.org - https://phabricator.wikimedia.org/T415307 [21:27:52] !log jsn@deploy2002 superpes, jsn: Backport for [[gerrit:1229130|[itwiki] Change tagline for Wikipedia 25 (T414320)]], [[gerrit:1229132|[hawiki] Add a temporary wordmark and tagline for Wikipedia 25 (T414736)]], [[gerrit:1230449|[tgwiki] Add a temporary logo for Wikipedia 25 (T415307)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:27:59] Testing (this will requires a couple of minutes being 3 different wikis and different skins) :) [21:28:38] nobody else in the queue, so don't sweat it [21:29:09] JSherman Everithing looks fine! Many thanks :) [21:31:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P87866 and previous config saved to /var/cache/conftool/dbconfig/20260122-213110-ladsgroup.json [21:33:55] !log jsn@deploy2002 superpes, jsn: Continuing with sync [21:34:08] Superpes: no problem! [21:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:38:05] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229130|[itwiki] Change tagline for Wikipedia 25 (T414320)]], [[gerrit:1229132|[hawiki] Add a temporary wordmark and tagline for Wikipedia 25 (T414736)]], [[gerrit:1230449|[tgwiki] Add a temporary logo for Wikipedia 25 (T415307)]] (duration: 12m 18s) [21:38:15] T414320: Requesting temporary logo change for it.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414320 [21:38:15] T414736: Requesting temporary logo change for ha.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414736 [21:38:16] T415307: Requesting temporary logo change for tg.wikipedia.org - https://phabricator.wikimedia.org/T415307 [21:38:20] Many thanks for your assistance JSherman [21:38:21] :3 [21:38:30] no problem :) [21:38:42] I think we can call this backport window done [21:41:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T410589)', diff saved to https://phabricator.wikimedia.org/P87867 and previous config saved to /var/cache/conftool/dbconfig/20260122-214119-ladsgroup.json [21:41:25] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:41:35] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2226.codfw.wmnet with reason: Maintenance [21:41:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T410589)', diff saved to https://phabricator.wikimedia.org/P87868 and previous config saved to /var/cache/conftool/dbconfig/20260122-214143-ladsgroup.json [21:45:12] (03PS1) 10DLynch: Revert "Toggler: Update heading toggler to match WAI ARIA pattern" [extensions/MobileFrontend] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230454 (https://phabricator.wikimedia.org/T415303) [21:52:16] ^ I'm going to need to do a backport of that once tests and approvals of the main patch finish. It might even be technically still within this backport window! But if not, it'll be in web's window, and it's reverting their thing anyway, so that should still be quite appropriate. [21:53:22] (03PS2) 10Superpes15: [u4cwiki] Add signature button to edit toolbar in Case namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137428 (https://phabricator.wikimedia.org/T392286) [21:57:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11546924 (10jhathaway) >>! In T415189#11544940, @MatthewVernon wrote: > @jhathaway I did ms-be2077 today, and see the same failur... [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T2200) [22:04:55] (03PS1) 10Gergő Tisza: WikimediaCustomizations: Set WMCBadEmailDomainsFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230462 (https://phabricator.wikimedia.org/T397244) [22:06:00] Just wanted to confirm the Web Team deployment window was open rn. There are a couple of security patches the sec.team would like to deploy. [22:06:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230454 (https://phabricator.wikimedia.org/T415303) (owner: 10DLynch) [22:08:03] (03Merged) 10jenkins-bot: Revert "Toggler: Update heading toggler to match WAI ARIA pattern" [extensions/MobileFrontend] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230454 (https://phabricator.wikimedia.org/T415303) (owner: 10DLynch) [22:08:24] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1230454|Revert "Toggler: Update heading toggler to match WAI ARIA pattern" (T415303 T407908)]] [22:08:31] T415303: DiscussionTools not working on mobile - https://phabricator.wikimedia.org/T415303 [22:08:32] T407908: Re-enable browsing by headings via rotor on VoiceOver (headings not found) - https://phabricator.wikimedia.org/T407908 [22:10:23] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1230454|Revert "Toggler: Update heading toggler to match WAI ARIA pattern" (T415303 T407908)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:10:25] (03CR) 10JHathaway: [C:03+1] nftables: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219877 (owner: 10Muehlenhoff) [22:11:00] !log kemayo@deploy2002 kemayo: Continuing with sync [22:11:35] (03CR) 10JHathaway: [C:03+1] Remove puppetmaster role from puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/1230331 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [22:11:52] (03CR) 10JHathaway: [C:03+1] Remove puppetmaster2001 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230332 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [22:12:03] (03CR) 10Bking: [C:03+1] wdqs: make avail SLOs dc & svc agnostic [puppet] - 10https://gerrit.wikimedia.org/r/1230399 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [22:12:22] (03CR) 10Ryan Kemper: [C:03+2] wdqs: make avail SLOs dc & svc agnostic [puppet] - 10https://gerrit.wikimedia.org/r/1230399 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [22:12:51] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/failoid-ng: apply [22:14:59] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/failoid-ng: apply [22:15:07] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230454|Revert "Toggler: Update heading toggler to match WAI ARIA pattern" (T415303 T407908)]] (duration: 06m 43s) [22:15:13] T415303: DiscussionTools not working on mobile - https://phabricator.wikimedia.org/T415303 [22:15:14] T407908: Re-enable browsing by headings via rotor on VoiceOver (headings not found) - https://phabricator.wikimedia.org/T407908 [22:19:47] (03PS1) 10Kamila Součková: failoid-ng: don't log deploys to SAL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230471 [22:32:25] (03PS2) 10Kamila Součková: failoid-ng: decrease namespace resource quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230471 [22:40:17] !log Deployed security fix for T411305 [22:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:57] !log Deployed security fix for T406088 [22:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:50] jouncebot: nowandnext [22:46:50] For the next 0 hour(s) and 13 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260122T2200) [22:46:51] In 8 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260123T0700) [22:48:09] is it ok if I backport some things? related to https://phabricator.wikimedia.org/T415309 [22:49:35] I was also looking to backport now [22:49:59] (03PS1) 10Clare Ming: Remove problematic logging for now [extensions/MetricsPlatform] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230487 (https://phabricator.wikimedia.org/T415309) [22:50:06] (03PS1) 10Dreamy Jazz: CheckUser: Read new for user agent table migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230488 (https://phabricator.wikimedia.org/T361199) [22:50:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230488 (https://phabricator.wikimedia.org/T361199) (owner: 10Dreamy Jazz) [22:50:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:51:20] Do you want to backport at the same time as my config patch? [22:51:20] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [22:51:30] !ack [22:51:31] 7361 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [22:51:36] (03Merged) 10jenkins-bot: CheckUser: Read new for user agent table migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230488 (https://phabricator.wikimedia.org/T361199) (owner: 10Dreamy Jazz) [22:51:54] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1230488|CheckUser: Read new for user agent table migration on group0 (T361199)]] [22:51:59] T361199: Set user agent schema migration config to read new on WMF wikis - https://phabricator.wikimedia.org/T361199 [22:52:09] Dreamy_Jazz: np - i can wait until you're done -- lmk and I'll backport some fixes [22:52:15] Sure [22:52:22] Will ping when done [22:52:25] ty [22:53:58] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1230488|CheckUser: Read new for user agent table migration on group0 (T361199)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:54:53] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [22:55:24] !log jasmine@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [22:55:57] Testing... [22:56:15] !log jasmine@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:57:19] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [23:00:55] cjming: You should be able to +2 your change now, as my sync will be done in a few mins [23:01:11] great - tysm! [23:01:33] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230488|CheckUser: Read new for user agent table migration on group0 (T361199)]] (duration: 09m 39s) [23:01:38] T361199: Set user agent schema migration config to read new on WMF wikis - https://phabricator.wikimedia.org/T361199 [23:01:43] I'm now fully done with my config patch [23:02:36] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: apply [23:02:37] (03PS1) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) [23:03:22] (03CR) 10Clare Ming: "recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:03:30] (03CR) 10Clare Ming: "recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:04:26] (03CR) 10CI reject: [V:04-1] Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:05:11] cool - i'm going to backport 2 things - just waiting for CI [23:06:21] (03Abandoned) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:07:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/MetricsPlatform] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230487 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:09:01] (03Restored) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:09:18] (03Merged) 10jenkins-bot: Remove problematic logging for now [extensions/MetricsPlatform] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230487 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:09:40] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1230487|Remove problematic logging for now (T415309)]] [23:09:45] T415309: Test kitchen producing errors in javascript console on every Wikipedia page - https://phabricator.wikimedia.org/T415309 [23:10:25] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: apply [23:10:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:11:39] !log cjming@deploy2002 cjming: Backport for [[gerrit:1230487|Remove problematic logging for now (T415309)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:12:02] !log cjming@deploy2002 cjming: Continuing with sync [23:14:44] (03PS3) 10Kamila Součková: failoid-ng: start breaking it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230471 [23:15:31] (03CR) 10Clare Ming: "recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:15:42] (03PS1) 10Seawolf35gerrit: rowiki: Set noindex for User: and User talk: Namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230498 (https://phabricator.wikimedia.org/T414992) [23:16:09] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230487|Remove problematic logging for now (T415309)]] (duration: 06m 29s) [23:16:15] T415309: Test kitchen producing errors in javascript console on every Wikipedia page - https://phabricator.wikimedia.org/T415309 [23:18:13] (03Abandoned) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:19:34] need to do one more - not sure what happened in my haste but i think i selected cherry pick topic and now the backport for Test Kitchen is all screwy [23:19:59] (03Restored) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:20:46] (03PS1) 10Jasmine: sophroid: add command line flags to sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230500 [23:22:40] ugh - i'm making a mess - is anyone around who can help me unravel https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1230486? [23:23:33] oh wait - maybe i have it [23:23:54] (03CR) 10RLazarus: [C:03+1] sophroid: add command line flags to sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230500 (owner: 10Jasmine) [23:23:55] (03CR) 10Scott French: [C:03+1] sophroid: add command line flags to sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230500 (owner: 10Jasmine) [23:24:11] (03CR) 10Jasmine: [C:03+2] sophroid: add command line flags to sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230500 (owner: 10Jasmine) [23:25:51] (03PS2) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) [23:26:06] (03Merged) 10jenkins-bot: sophroid: add command line flags to sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230500 (owner: 10Jasmine) [23:26:51] (03CR) 10CI reject: [V:04-1] Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:27:24] (03PS3) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) [23:27:43] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: apply [23:27:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:28:13] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: apply [23:28:27] !ack [23:28:28] 7362 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [23:28:57] will finish up here in a few minutes [23:29:09] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: apply [23:33:23] (03CR) 10CI reject: [V:04-1] Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:33:44] (03PS4) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) [23:36:09] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: apply [23:39:40] (03PS1) 10Jasmine: sophroid: add port command line flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230502 [23:40:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:41:42] (03Merged) 10jenkins-bot: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230486 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [23:41:59] (03CR) 10Scott French: [C:03+1] sophroid: add port command line flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230502 (owner: 10Jasmine) [23:42:03] (03CR) 10RLazarus: [C:03+1] sophroid: add port command line flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230502 (owner: 10Jasmine) [23:42:05] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1230486|Remove problematic logging for now (T415309)]] [23:42:09] T415309: Test kitchen producing errors in javascript console on every Wikipedia page - https://phabricator.wikimedia.org/T415309 [23:42:31] (03CR) 10Jasmine: [C:03+2] sophroid: add port command line flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230502 (owner: 10Jasmine) [23:44:02] !log cjming@deploy2002 cjming: Backport for [[gerrit:1230486|Remove problematic logging for now (T415309)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:44:21] (03Merged) 10jenkins-bot: sophroid: add port command line flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230502 (owner: 10Jasmine) [23:44:22] !log cjming@deploy2002 cjming: Continuing with sync [23:45:28] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: apply [23:48:32] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230486|Remove problematic logging for now (T415309)]] (duration: 06m 27s) [23:48:36] T415309: Test kitchen producing errors in javascript console on every Wikipedia page - https://phabricator.wikimedia.org/T415309 [23:55:03] (03PS1) 10Jasmine: sophroid: remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230510 [23:55:57] (03PS1) 10Ryan Kemper: opensearch-semantic-search: enable ceph [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230511 (https://phabricator.wikimedia.org/T414702) [23:55:59] (03PS1) 10Ryan Kemper: opensearch-semantic-search: provision namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230512 (https://phabricator.wikimedia.org/T414702) [23:56:07] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: apply [23:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown