[00:00:06] (03PS2) 10Ryan Kemper: opensearch-semantic-search: provision namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230512 (https://phabricator.wikimedia.org/T414702) [00:01:00] !log sukhe@cumin1003 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe2009*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [00:01:09] (03PS2) 10Jasmine: sophroid: remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230510 [00:01:29] !log sukhe@cumin1003 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe2009*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [00:02:18] (03CR) 10Scott French: [C:03+1] sophroid: remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230510 (owner: 10Jasmine) [00:03:32] (03CR) 10RLazarus: [C:03+1] sophroid: remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230510 (owner: 10Jasmine) [00:03:36] (03CR) 10Jasmine: [C:03+2] sophroid: remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230510 (owner: 10Jasmine) [00:05:34] (03Merged) 10jenkins-bot: sophroid: remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230510 (owner: 10Jasmine) [00:06:07] (03PS3) 10Ryan Kemper: opensearch-semantic-search: provision namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230512 (https://phabricator.wikimedia.org/T414702) [00:08:10] sorry - i need to follow up with a few more backports 😬 [00:12:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [00:13:33] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: apply [00:14:23] !log jasmine@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: apply [00:19:49] !log jasmine@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/sophroid: apply [00:20:17] !log jasmine@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/sophroid: apply [00:20:33] (03PS1) 10Zabe: Start reading from il_target_id from s5 and s8 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230533 (https://phabricator.wikimedia.org/T413669) [00:20:35] i missed a few more messages that are spamming the console - it should be quick [00:29:40] FIRING: [7x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:09] (03PS1) 10Clare Ming: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230537 (https://phabricator.wikimedia.org/T415309) [00:38:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230537 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [00:39:54] (03Merged) 10jenkins-bot: Remove problematic logging for now [extensions/TestKitchen] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230537 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [00:40:13] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1230537|Remove problematic logging for now (T415309)]] [00:40:18] T415309: Test kitchen producing errors in javascript console on every Wikipedia page - https://phabricator.wikimedia.org/T415309 [00:40:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1230538 [00:40:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1230538 (owner: 10TrainBranchBot) [00:41:59] (03PS1) 10Clare Ming: Remove problematic logging for now [extensions/MetricsPlatform] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230539 (https://phabricator.wikimedia.org/T415309) [00:42:18] !log cjming@deploy2002 cjming: Backport for [[gerrit:1230537|Remove problematic logging for now (T415309)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:43:12] !log cjming@deploy2002 cjming: Continuing with sync [00:47:25] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230537|Remove problematic logging for now (T415309)]] (duration: 07m 12s) [00:47:31] T415309: Test kitchen producing errors in javascript console on every Wikipedia page - https://phabricator.wikimedia.org/T415309 [00:48:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/MetricsPlatform] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230539 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [00:53:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1230538 (owner: 10TrainBranchBot) [00:53:52] (03Merged) 10jenkins-bot: Remove problematic logging for now [extensions/MetricsPlatform] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1230539 (https://phabricator.wikimedia.org/T415309) (owner: 10Clare Ming) [00:54:32] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1230539|Remove problematic logging for now (T415309)]] [00:54:37] T415309: Test kitchen producing errors in javascript console on every Wikipedia page - https://phabricator.wikimedia.org/T415309 [00:56:29] !log cjming@deploy2002 cjming: Backport for [[gerrit:1230539|Remove problematic logging for now (T415309)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:56:46] (03PS1) 10RLazarus: sophroid: Re-insert readiness probe, as a gRPC probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230544 [00:56:55] !log cjming@deploy2002 cjming: Continuing with sync [01:00:59] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230539|Remove problematic logging for now (T415309)]] (duration: 06m 27s) [01:01:04] T415309: Test kitchen producing errors in javascript console on every Wikipedia page - https://phabricator.wikimedia.org/T415309 [01:01:34] ok done [01:10:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1230545 [01:10:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1230545 (owner: 10TrainBranchBot) [01:24:18] jouncebot: nowandnext [01:24:18] No deployments scheduled for the next 5 hour(s) and 35 minute(s) [01:24:18] In 5 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260123T0700) [01:24:25] (03CR) 10Zabe: [C:03+2] Start reading from il_target_id from s5 and s8 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230533 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [01:25:13] (03Merged) 10jenkins-bot: Start reading from il_target_id from s5 and s8 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230533 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [01:25:35] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1230533|Start reading from il_target_id from s5 and s8 wikis (T413669)]] [01:27:48] !log zabe@deploy2002 zabe: Backport for [[gerrit:1230533|Start reading from il_target_id from s5 and s8 wikis (T413669)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:27:54] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [01:28:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:28:47] !log zabe@deploy2002 zabe: Continuing with sync [01:30:42] (03PS1) 10Ladsgroup: kerberos: Add a space after period in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1230547 [01:32:51] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230533|Start reading from il_target_id from s5 and s8 wikis (T413669)]] (duration: 07m 16s) [01:32:56] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [01:33:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:33:44] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1230545 (owner: 10TrainBranchBot) [01:34:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87869 and previous config saved to /var/cache/conftool/dbconfig/20260123-013402-marostegui.json [01:34:10] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [01:34:11] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [01:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:44:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P87870 and previous config saved to /var/cache/conftool/dbconfig/20260123-014411-marostegui.json [01:54:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P87871 and previous config saved to /var/cache/conftool/dbconfig/20260123-015419-marostegui.json [02:04:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87872 and previous config saved to /var/cache/conftool/dbconfig/20260123-020427-marostegui.json [02:04:35] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:04:36] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:04:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [02:04:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87873 and previous config saved to /var/cache/conftool/dbconfig/20260123-020453-marostegui.json [02:19:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87874 and previous config saved to /var/cache/conftool/dbconfig/20260123-021940-marostegui.json [02:19:47] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:19:48] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:29:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P87875 and previous config saved to /var/cache/conftool/dbconfig/20260123-022948-marostegui.json [02:39:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P87876 and previous config saved to /var/cache/conftool/dbconfig/20260123-023957-marostegui.json [02:42:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:43:11] sigh [02:43:12] !ack [02:43:13] 7363 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [02:46:45] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven! This looks right in terms of implementing the k8s side of things." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230544 (owner: 10RLazarus) [02:47:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:47:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229946 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [02:48:02] !ack [02:48:03] no value provided for parameter incident and no default available [02:48:03] All incidents are already acked. [02:50:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87877 and previous config saved to /var/cache/conftool/dbconfig/20260123-025005-marostegui.json [02:50:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [02:50:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:50:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:50:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1177 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87878 and previous config saved to /var/cache/conftool/dbconfig/20260123-025018-marostegui.json [02:57:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:57:54] !ack [02:57:54] no value provided for parameter incident and no default available [02:57:54] All incidents are already acked. [02:58:32] !incidents [02:58:32] 7363 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [02:58:33] 7362 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [02:58:33] 7361 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [02:58:33] 7360 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [02:58:33] 7358 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [03:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:32:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:29:40] FIRING: [7x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:10:03] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11547679 (10RKemper) Merged patch for the new SLO (and corresponding recording rules; I realized pyrra wants stuff in term... [05:18:01] (03PS1) 10Ryan Kemper: WDQS: separate avail SLOs per service [puppet] - 10https://gerrit.wikimedia.org/r/1230672 (https://phabricator.wikimedia.org/T393966) [05:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:23] (03PS1) 10Marostegui: dbproxy2008: Migration to Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1230688 (https://phabricator.wikimedia.org/T414656) [05:42:00] (03CR) 10Marostegui: [C:03+2] dbproxy2008: Migration to Debian Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1230688 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [05:42:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy2008.codfw.wmnet with OS trixie [05:57:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1009.eqiad.wmnet with reason: long schema change [05:57:59] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2008.codfw.wmnet with reason: host reimage [05:59:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T410589)', diff saved to https://phabricator.wikimedia.org/P87879 and previous config saved to /var/cache/conftool/dbconfig/20260123-055859-ladsgroup.json [05:59:06] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:03:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2008.codfw.wmnet with reason: host reimage [06:09:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P87880 and previous config saved to /var/cache/conftool/dbconfig/20260123-060908-ladsgroup.json [06:17:58] (03PS1) 10Stang: zhwiki: Remove extra autoconfirmed limit for Tor user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) [06:18:45] (03CR) 10CI reject: [V:04-1] zhwiki: Remove extra autoconfirmed limit for Tor user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [06:19:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P87881 and previous config saved to /var/cache/conftool/dbconfig/20260123-061915-ladsgroup.json [06:20:03] (03PS2) 10Stang: zhwiki: Remove extra autoconfirmed limit for Tor user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) [06:20:49] (03CR) 10CI reject: [V:04-1] zhwiki: Remove extra autoconfirmed limit for Tor user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [06:21:44] (03PS3) 10Stang: zhwiki: Remove extra autoconfirmed limit for Tor user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) [06:26:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2008.codfw.wmnet with OS trixie [06:29:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T410589)', diff saved to https://phabricator.wikimedia.org/P87882 and previous config saved to /var/cache/conftool/dbconfig/20260123-062924-ladsgroup.json [06:29:30] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:29:40] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2238.codfw.wmnet with reason: Maintenance [06:29:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T410589)', diff saved to https://phabricator.wikimedia.org/P87883 and previous config saved to /var/cache/conftool/dbconfig/20260123-062948-ladsgroup.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260123T0700) [07:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:42:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87884 and previous config saved to /var/cache/conftool/dbconfig/20260123-074215-marostegui.json [07:42:23] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:42:24] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:46:14] (03PS1) 10Muehlenhoff: Record LDAP access for jerrywang [puppet] - 10https://gerrit.wikimedia.org/r/1230763 [07:50:03] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for jerrywang [puppet] - 10https://gerrit.wikimedia.org/r/1230763 (owner: 10Muehlenhoff) [07:52:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P87885 and previous config saved to /var/cache/conftool/dbconfig/20260123-075223-marostegui.json [07:59:13] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260123T0800) [08:02:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P87886 and previous config saved to /var/cache/conftool/dbconfig/20260123-080232-marostegui.json [08:12:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87887 and previous config saved to /var/cache/conftool/dbconfig/20260123-081240-marostegui.json [08:12:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [08:12:47] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:12:48] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:29:40] FIRING: [7x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:49] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11547878 (10Clement_Goubert) >>! In T408757#11546627, @Jhancock.wm wrote: > @Clement_Goubert all of the servers except wikikube-worker2346 are installed and... [08:32:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:33:06] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2346 DOA - https://phabricator.wikimedia.org/T414708#11547880 (10Clement_Goubert) [08:33:24] Woot [08:34:09] !incidents [08:34:09] 7364 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [08:34:09] 7363 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [08:34:10] 7362 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [08:34:10] 7361 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [08:34:10] 7360 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [08:34:10] 7358 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [08:34:14] !ack 7364 [08:34:14] 7364 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [08:34:27] marostegui: you can !ack without argument and it will ack the last one [08:34:38] claime: Ah thanks :) [08:35:51] checking too [08:37:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:58:15] (03CR) 10Elukey: "Hi folks! I appreciate a lot the follow up for the wdqs configs, but please reach out to somebody from the SLO working group before mergin" [puppet] - 10https://gerrit.wikimedia.org/r/1230399 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [09:01:01] (03CR) 10Elukey: "Moreover:" [puppet] - 10https://gerrit.wikimedia.org/r/1230399 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [09:07:16] !log installing Linux 6.1.159 on Bookworm hosts [09:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:06] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Requesting `analytics-admins` access for AKhatun - https://phabricator.wikimedia.org/T414846#11547917 (10Gehel) [09:08:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11547916 (10Gehel) [09:08:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942#11547918 (10Gehel) [09:20:09] RESOLVED: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:36:50] 06SRE, 10MW-on-K8s, 06serviceops: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#11547964 (10MoritzMuehlenhoff) 05Stalled→03Resolved a:03MoritzMuehlenhoff This is long done [09:40:56] (03CR) 10Elukey: [C:03+2] docker_registry: simplify and improve the /v2/ comment [puppet] - 10https://gerrit.wikimedia.org/r/1229143 (owner: 10Elukey) [09:41:53] (03PS2) 10Elukey: DNM: docker_registry: move /v2/restricted to the s3 restricted backend [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) [09:44:30] (03PS3) 10Elukey: docker_registry: move /v2/restricted to the s3 restricted backend [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) [09:44:59] (03CR) 10Elukey: "To be merged after the SRE Summit." [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [09:46:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [09:48:36] (03CR) 10Elukey: ml-builder-docker: add group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1230280 (owner: 10Dpogorzelski) [09:50:53] (03CR) 10Clément Goubert: [C:03+1] failoid-ng: start breaking it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230471 (owner: 10Kamila Součková) [09:52:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [09:56:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [09:57:58] (03PS1) 10Jgiannelos: mobileapps: Define max-semi-space-size for node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230877 (https://phabricator.wikimedia.org/T410296) [09:59:01] (03CR) 10Clément Goubert: [C:03+1] mobileapps: Define max-semi-space-size for node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230877 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [10:01:18] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Define max-semi-space-size for node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230877 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [10:02:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11548001 (10Gehel) [10:02:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11548005 (10Gehel) [10:03:05] (03Merged) 10jenkins-bot: mobileapps: Define max-semi-space-size for node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230877 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [10:03:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [10:03:59] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work, 13Patch-For-Review: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11548029 (10Gehel) [10:04:11] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), and 3 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11548040 (10Gehel) [10:04:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11548042 (10Gehel) [10:04:31] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11548046 (10Gehel) [10:04:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11548048 (10Gehel) [10:05:37] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:05:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216#11548076 (10Gehel) [10:05:50] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:06:03] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:06:10] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970#11548100 (10Gehel) [10:06:16] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413#11548098 (10Gehel) [10:06:47] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:07:06] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:07:49] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:12:27] 06SRE, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11548233 (10Gehel) [10:20:45] 10SRE-SLO, 10observability, 10Wikidata-Query-Service, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), and 2 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11548287 (10gmodena) [10:27:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:28:29] 06SRE, 06Infrastructure-Foundations: Build OpenGear serial port config from Netbox - https://phabricator.wikimedia.org/T415345 (10cmooney) 03NEW p:05Triage→03Low [10:45:28] 06SRE, 06Infrastructure-Foundations: Migrate diffscan VM to Trixie - https://phabricator.wikimedia.org/T415347 (10MoritzMuehlenhoff) 03NEW [10:48:20] (03PS3) 10Slyngshede: Docker build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) [10:59:22] (03CR) 10Elukey: Docker build (034 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [11:06:03] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:27] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Update [11:09:13] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:31] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [11:14:13] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:17:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:18:06] (03CR) 10Clément Goubert: [C:03+2] failoid-ng: start breaking it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230471 (owner: 10Kamila Součková) [11:18:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::a6e1:1a00:106f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:19:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11548425 (10Arnoldokoth) 05Open→03In progress a:03Arnoldokoth [11:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:20:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11548431 (10Arnoldokoth) [11:21:14] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11548435 (10Arnoldokoth) @Ottomata Kindly approve. [11:23:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::a6e1:1a00:106f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:24:14] 06SRE, 06Infrastructure-Foundations: Build OpenGear serial port config from Netbox - https://phabricator.wikimedia.org/T415345#11548437 (10cmooney) Despite the fact I should be spending time on other things I had a bash at this: https://github.com/topranks/openconfigports [11:24:25] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:37] (03Merged) 10jenkins-bot: failoid-ng: start breaking it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230471 (owner: 10Kamila Součková) [11:43:58] (03CR) 10Silvan Heintze: [C:03+1] "Yes, sounds reasonable. Thanks." [dumps] - 10https://gerrit.wikimedia.org/r/1229127 (https://phabricator.wikimedia.org/T408423) (owner: 10Jakob) [12:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260123T0800) [12:00:04] jelto, arnoldokoth, mutante, and arnaudb: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260123T1200). [12:04:02] (03CR) 10Elukey: Docker build (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [12:04:54] aokoth@cumin1003 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade. [12:05:46] (03PS1) 10Muehlenhoff: Stop running the IP reputation dump on the Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230912 (https://phabricator.wikimedia.org/T365798) [12:05:48] (03PS1) 10Muehlenhoff: Remove ip_reputation_vendors from Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230913 (https://phabricator.wikimedia.org/T365798) [12:06:16] (03CR) 10CI reject: [V:04-1] Stop running the IP reputation dump on the Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230912 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:07:54] aokoth@cumin1003 upgrade (PID 3249333) is awaiting input [12:10:32] (03PS2) 10Muehlenhoff: Stop running the IP reputation dump on the Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230912 (https://phabricator.wikimedia.org/T365798) [12:14:43] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:15:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1230912 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:15:43] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 117333 bytes in 0.513 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:18:00] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Update [12:18:48] (03PS1) 10Muehlenhoff: Remove ip_reputation_vendors from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/1230914 (https://phabricator.wikimedia.org/T365798) [12:19:25] FIRING: [10x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:57] (03Abandoned) 10Muehlenhoff: Remove ip_reputation_vendors from Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230913 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:24:11] (03CR) 10Pmiazga: [C:03+1] rest-gateway: add support for sessionJwt cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224173 (owner: 10Daniel Kinzler) [12:24:25] FIRING: [10x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:10] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11548597 (10cmooney) So looking at dse-k8s-worker1013 it has now been up for 1 day 18 hours, yet we st... [12:32:44] !log uploaded dnsmasq 2.92-1~wmf12u to bookworm-wikimedia/main T396864 [12:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] T396864: Routed Ganeti: same node DHCP limitation - https://phabricator.wikimedia.org/T396864 [12:34:29] (03PS4) 10Slyngshede: Docker build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) [12:35:37] (03CR) 10Slyngshede: Docker build (035 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [12:48:43] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1230912 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:53:59] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:55:02] (03PS1) 10Jgiannelos: mobileapps: Revert to last known working state (node18) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230917 (https://phabricator.wikimedia.org/T410296) [12:55:43] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:09:09] (03CR) 10Elukey: [C:03+1] "LTGM!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [13:20:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11548743 (10JAllemandou) It seems that the `dse-k8s-worker1019` still has the problem: {F71597128} [13:24:25] FIRING: [10x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:00] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:29:25] FIRING: [10x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:40] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt tools-k8 - jclark@cumin1003" [13:29:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt tools-k8 - jclark@cumin1003" [13:29:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:32:38] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-ctrl1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:32:53] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-ctrl1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:33:04] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:33:16] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:33:41] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:29] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [13:34:39] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:35:17] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [13:36:15] PROBLEM - Juniper alarms on asw2-d-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.27 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:39:18] jclark@cumin1003 provision (PID 3266978) is awaiting input [13:40:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-ctrl1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:40:41] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-ctrl1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:40:51] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:41:19] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:42:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:42:12] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10ServiceOps-Datastores, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#11548824 (10MLechvien-WMF) Removing our tag, please add it back if anything is needed from our end [13:44:05] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1003 [13:44:10] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:44:18] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1003 [13:44:45] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:48:35] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [13:48:42] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie [13:48:44] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11548843 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [13:48:46] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [13:48:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11548844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie [13:48:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11548845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [13:48:57] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie [13:49:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11548854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie [13:49:29] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:51:25] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [13:51:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11548860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [13:55:13] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:58:01] jclark@cumin1003 reimage (PID 3270130) is awaiting input [13:59:47] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1004.eqiad.wmnet with reason: host reimage [13:59:59] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1001.eqiad.wmnet with reason: host reimage [14:00:05] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [14:00:11] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [14:02:43] jclark@cumin1003 provision (PID 3270481) is awaiting input [14:04:43] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1004.eqiad.wmnet with reason: host reimage [14:07:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1001.eqiad.wmnet with reason: host reimage [14:07:54] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:08:02] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [14:08:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11548998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1002.eqiad.wmnet with OS trixie executed with errors: - tools-k8s-worker1002 (**F... [14:09:50] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:10:20] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:10:37] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:10:47] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:11:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:11:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [14:16:25] jclark@cumin1003 provision (PID 3271305) is awaiting input [14:19:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [14:19:27] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:19:50] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:20:02] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:20:52] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:21:34] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:22:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:22:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [14:23:06] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549045 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1004.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1004 (**PASS**) -... [14:23:54] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:24:55] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:26:28] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:26:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [14:26:37] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1001.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1001 (**PASS**) -... [14:27:41] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:27:42] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:28:18] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:28:19] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie [14:28:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie completed: - tools-k8s-ctrl1001 (**PASS**) - Remo... [14:35:41] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:37:50] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413005#11549076 (10phaultfinder) [14:38:45] jclark@cumin1003 reimage (PID 3267808) is awaiting input [14:42:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:42:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie [14:43:05] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie completed: - tools-k8s-ctrl1002 (**PASS**) - Remo... [14:52:47] (03CR) 10Clément Goubert: [C:03+1] mobileapps: Revert to last known working state (node18) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230917 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [15:04:07] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11549154 (10MoritzMuehlenhoff) [15:05:24] (03PS1) 10Jdrewniak: WP25EasterEggs added to extension-list, config var, enabled on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230931 (https://phabricator.wikimedia.org/T415372) [15:16:43] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Revert to last known working state (node18) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230917 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [15:18:42] (03Merged) 10jenkins-bot: mobileapps: Revert to last known working state (node18) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230917 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [15:19:13] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:22:47] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:23:13] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:23:33] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:23:44] (03PS1) 10Fabfur: varnish: set Retry-After for cli_tool, wdqs and library policies [puppet] - 10https://gerrit.wikimedia.org/r/1230937 (https://phabricator.wikimedia.org/T415375) [15:24:15] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:24:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1230937 (https://phabricator.wikimedia.org/T415375) (owner: 10Fabfur) [15:24:48] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:25:30] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:26:49] RECOVERY - MariaDB Replica Lag: s1 on dbstore1008 is OK: OK slave_sql_lag Replication lag: 49.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:30:57] (03PS1) 10Clément Goubert: failoid-ng: Break completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230938 [15:31:06] (03CR) 10CI reject: [V:04-1] failoid-ng: Break completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230938 (owner: 10Clément Goubert) [15:31:20] (03PS2) 10Clément Goubert: failoid-ng: Break completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230938 [15:41:35] (03PS3) 10Clément Goubert: failoid-ng: Break completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230938 [15:52:09] (03CR) 10Ahmon Dancy: "Exciting!" [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [15:53:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:54:03] !ack [15:54:04] 7365 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [15:54:10] hellooo [15:54:21] same thing as we got in the morning [15:54:40] yep: https://grafana.wikimedia.org/goto/aXOvJ8SDg?orgId=1 [15:54:46] and yesterday and tonight [15:54:52] marostegui: you clearly didn't apply the right cookbooks [15:55:02] elukey: busy with cumin! [15:55:18] we're still at a pretty low rps of errors [15:56:07] (03PS24) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [15:56:18] (03PS3) 10Daniel Kinzler: rest-gateway: add support for sessionJwt cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224173 [15:58:17] volans: godo.g mentioned this for this specific alert https://phabricator.wikimedia.org/T400675 [15:58:47] (03CR) 10Brouberol: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [15:59:51] (03PS1) 10Xcollazo: Scale down mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230950 (https://phabricator.wikimedia.org/T411803) [16:01:05] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: add support for sessionJwt cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224173 (owner: 10Daniel Kinzler) [16:02:35] (03CR) 10Brouberol: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230950 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [16:02:46] (03CR) 10JavierMonton: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230950 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [16:03:49] (03CR) 10A-pizzata: [C:03+1] Scale down mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230950 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [16:04:04] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11549362 (10Vgutierrez) [16:04:08] 10ops-eqsin, 06SRE: Unresponsive management for cp5022.mgmt:22 - https://phabricator.wikimedia.org/T414879#11549365 (10Vgutierrez) →14Duplicate dup:03T414411 [16:04:31] (03Merged) 10jenkins-bot: Scale down mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230950 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [16:04:43] (03CR) 10Kamila Součková: [C:03+1] api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [16:05:20] !log vgutierrez@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on cp5022.eqsin.wmnet with reason: cp5022 is unreacheable [16:05:36] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11549368 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=85b11191-0733-4a6c-a314-a87c77eb102d) set by vgutierrez@cumin1003 for 10 days, 0:00:00 on 1 host(s) and their services with... [16:05:37] ah I was about to do the same [16:07:40] (03CR) 10Pmiazga: [C:03+1] rest-gateway: add support for sessionJwt cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224173 (owner: 10Daniel Kinzler) [16:08:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:09:18] !ack [16:09:18] no value provided for parameter incident and no default available [16:09:18] All incidents are already acked. [16:09:31] !incidents [16:09:31] 7365 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [16:09:31] 7364 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [16:09:32] 7363 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [16:09:32] 7362 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [16:09:32] 7361 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [16:09:32] 7360 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [16:10:33] (03CR) 10Brouberol: [C:03+2] dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [16:13:30] (03CR) 10Daniel Kinzler: [C:03+2] api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [16:13:33] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: add support for sessionJwt cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224173 (owner: 10Daniel Kinzler) [16:15:26] (03PS1) 10Clément Goubert: thumbor: 100 replicas to absorb queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230951 [16:15:45] (03Merged) 10jenkins-bot: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [16:15:47] (03Merged) 10jenkins-bot: rest-gateway: add support for sessionJwt cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224173 (owner: 10Daniel Kinzler) [16:16:20] (03CR) 10Hnowlan: [C:03+1] thumbor: 100 replicas to absorb queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230951 (owner: 10Clément Goubert) [16:16:26] (03CR) 10Volans: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230951 (owner: 10Clément Goubert) [16:16:27] (03PS1) 10Ahmon Dancy: pretrain: Run one hour later, at 02:00UTC [puppet] - 10https://gerrit.wikimedia.org/r/1230952 (https://phabricator.wikimedia.org/T398873) [16:16:39] (03CR) 10Elukey: [C:03+1] thumbor: 100 replicas to absorb queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230951 (owner: 10Clément Goubert) [16:17:18] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [16:17:19] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1230952 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [16:17:25] (03CR) 10Clément Goubert: [C:03+2] thumbor: 100 replicas to absorb queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230951 (owner: 10Clément Goubert) [16:18:24] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:19:24] (03Merged) 10jenkins-bot: thumbor: 100 replicas to absorb queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230951 (owner: 10Clément Goubert) [16:19:37] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:20:07] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:20:15] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:20:22] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:20:53] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt tools-k8 - jclark@cumin1003" [16:20:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt tools-k8 - jclark@cumin1003" [16:20:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:08] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:23:00] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [16:23:39] (03PS1) 10Muehlenhoff: Record LDAP access for lerickson [puppet] - 10https://gerrit.wikimedia.org/r/1230955 [16:23:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:25:22] !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:25:45] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [16:25:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:26:15] (03PS1) 10Xcollazo: dumps: Fix MW Content File Export. Remove already absented file def. [puppet] - 10https://gerrit.wikimedia.org/r/1230956 (https://phabricator.wikimedia.org/T414389) [16:26:43] (03CR) 10Ahmon Dancy: "PCC output: https://puppet-compiler.wmflabs.org/output/1230952/5691/deploy2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1230952 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [16:26:57] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for lerickson [puppet] - 10https://gerrit.wikimedia.org/r/1230955 (owner: 10Muehlenhoff) [16:28:13] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11549433 (10RobH) While I've contacted Jin to do this work (T415090) I'm hesitant to do so during the week of the SRE offsite. While I am attending remotely, the shift I'll have to make to attend in... [16:28:45] (03CR) 10Brouberol: [C:03+1] dumps: Fix MW Content File Export. Remove already absented file def. [puppet] - 10https://gerrit.wikimedia.org/r/1230956 (https://phabricator.wikimedia.org/T414389) (owner: 10Xcollazo) [16:28:48] (03CR) 10Brouberol: [C:03+2] dumps: Fix MW Content File Export. Remove already absented file def. [puppet] - 10https://gerrit.wikimedia.org/r/1230956 (https://phabricator.wikimedia.org/T414389) (owner: 10Xcollazo) [16:30:18] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt tools-k8 - jclark@cumin1003" [16:30:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt tools-k8 - jclark@cumin1003" [16:30:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:31:00] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:31:11] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:33:21] (03CR) 10BryanDavis: [C:03+1] "Nice find." [puppet] - 10https://gerrit.wikimedia.org/r/1230952 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [16:35:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:36:57] jclark@cumin1003 provision (PID 3294117) is awaiting input [16:38:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:39:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:41:12] (03PS1) 10Jdrewniak: Bumping portals submodule to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230961 (https://phabricator.wikimedia.org/T128546) [16:41:16] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:43:05] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [16:43:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [16:43:39] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:48:29] !log dancy@deploy2002 Installing scap version "4.235.0" for 2 host(s) [16:49:16] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:50:20] !log dancy@deploy2002 Installation of scap version "4.235.0" completed for 2 hosts [16:52:37] (03PS1) 10Xcollazo: dumps: Update index.html file to reflect XML dumps deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230965 (https://phabricator.wikimedia.org/T414389) [16:53:22] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11549524 (10MoritzMuehlenhoff) [16:53:51] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549527 (10Jclark-ctr) [16:53:59] (03CR) 10Brouberol: [C:03+1] "No more DVDs :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230965 (https://phabricator.wikimedia.org/T414389) (owner: 10Xcollazo) [16:54:03] (03CR) 10Brouberol: [C:03+2] dumps: Update index.html file to reflect XML dumps deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230965 (https://phabricator.wikimedia.org/T414389) (owner: 10Xcollazo) [16:54:04] (03CR) 10Brouberol: [V:03+2 C:03+2] dumps: Update index.html file to reflect XML dumps deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230965 (https://phabricator.wikimedia.org/T414389) (owner: 10Xcollazo) [16:55:04] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1003.eqiad.wmnet with reason: host reimage [16:56:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:56:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:56:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [16:57:10] !ack [16:57:11] 7366 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet codfw) [16:57:36] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [16:57:44] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [16:59:26] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1003.eqiad.wmnet with reason: host reimage [17:01:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [17:05:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:58] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1002.eqiad.wmnet with reason: host reimage [17:13:12] (03PS1) 10Kamila Součková: Revert "api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230969 [17:13:17] (03PS1) 10Elukey: dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) [17:13:22] (03CR) 10CI reject: [V:04-1] Revert "api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230969 (owner: 10Kamila Součková) [17:13:35] (03PS1) 10Kamila Součková: Revert "rest-gateway: add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230971 [17:14:41] (03CR) 10CI reject: [V:04-1] dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [17:14:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1002.eqiad.wmnet with reason: host reimage [17:15:26] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:16:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:16:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [17:16:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549599 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1003.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1003 (**PASS**) -... [17:19:50] (03CR) 10Kamila Součková: [C:03+2] Revert "rest-gateway: add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230971 (owner: 10Kamila Součková) [17:20:55] (03PS2) 10Kamila Součková: Revert "api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230969 [17:21:04] (03CR) 10CI reject: [V:04-1] Revert "api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230969 (owner: 10Kamila Součková) [17:21:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:22:01] (03Merged) 10jenkins-bot: Revert "rest-gateway: add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230971 (owner: 10Kamila Součková) [17:22:34] (03PS3) 10Kamila Součková: Revert "api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230969 [17:25:29] (03CR) 10Kamila Součková: [C:03+2] Revert "api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230969 (owner: 10Kamila Součková) [17:27:49] (03Merged) 10jenkins-bot: Revert "api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230969 (owner: 10Kamila Součková) [17:28:18] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:28:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:28:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [17:28:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1002.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1002 (**PASS**) -... [17:28:54] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:29:24] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:29:25] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11549642 (10Jclark-ctr) [17:34:13] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:40:47] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11549678 (10Quiddity) [17:42:01] (03PS1) 10Arlolra: Deploy PRV to 20 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230974 (https://phabricator.wikimedia.org/T415386) [17:49:55] (03PS1) 10Clément Goubert: failoid-ng: Prepare 10 releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230976 [17:51:04] (03CR) 10Clément Goubert: [C:03+2] failoid-ng: Break completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230938 (owner: 10Clément Goubert) [17:51:19] (03PS2) 10Arlolra: Deploy PRV to 21 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230974 (https://phabricator.wikimedia.org/T415386) [17:52:50] (03Merged) 10jenkins-bot: failoid-ng: Break completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230938 (owner: 10Clément Goubert) [17:54:02] (03CR) 10Clément Goubert: [C:03+2] failoid-ng: Prepare 10 releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230976 (owner: 10Clément Goubert) [17:55:51] (03Merged) 10jenkins-bot: failoid-ng: Prepare 10 releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230976 (owner: 10Clément Goubert) [17:59:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:01:59] PROBLEM - Host wikikube-worker1108 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 7991.51 ms [18:02:33] RECOVERY - Host wikikube-worker1108 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:12:25] (03CR) 10Thcipriani: [C:03+1] WP25EasterEggs added to extension-list, config var, enabled on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230931 (https://phabricator.wikimedia.org/T415372) (owner: 10Jdrewniak)