[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0000) [00:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 11.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:06:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 897.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:16:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 862.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:18:52] (03PS1) 10Pppery: Generate our own logo thumbnails rather than using MediaWiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) [00:19:48] (03CR) 10CI reject: [V:04-1] Generate our own logo thumbnails rather than using MediaWiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) (owner: 10Pppery) [00:19:58] (03PS2) 10Pppery: Generate our own logo thumbnails rather than using MediaWiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) [00:20:47] (03PS3) 10Pppery: Generate our own logo thumbnails rather than using MediaWiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) [00:20:49] (03CR) 10CI reject: [V:04-1] Generate our own logo thumbnails rather than using MediaWiki's [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) (owner: 10Pppery) [00:25:06] (03CR) 10Pppery: "Adding some people who looked at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1217282 as reviewers for this follow-up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) (owner: 10Pppery) [00:28:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233674 (https://phabricator.wikimedia.org/T413951) (owner: 10STran) [00:29:17] (03CR) 10STran: [C:03+1] IPInfo: Grant ipinfo-view-arbitrary-ip to checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242424 (https://phabricator.wikimedia.org/T374718) (owner: 10Kosta Harlan) [00:36:23] FIRING: GnmiTargetDown: asw1-22-ulsfo is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [00:38:33] (03PS1) 10Catrope: Remove workaround for T370517, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242543 (https://phabricator.wikimedia.org/T370517) [00:39:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1242544 [00:39:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1242544 (owner: 10TrainBranchBot) [00:40:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242543 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [00:41:22] RESOLVED: GnmiTargetDown: asw1-22-ulsfo is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [00:47:03] (03CR) 10Dzahn: [C:03+2] "Jelto, when I tried to deploy this to staging I got this effect where the command line just sits there for a long time until it eventually" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [00:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:51:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1242544 (owner: 10TrainBranchBot) [00:56:38] (03PS2) 10Aaron Schulz: Copy rest_v1-wikimedia.json to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224228 (https://phabricator.wikimedia.org/T396807) [01:00:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224228 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [01:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:08:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1242559 [01:08:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1242559 (owner: 10TrainBranchBot) [01:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:33:11] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1242559 (owner: 10TrainBranchBot) [01:36:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:40] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242568 [01:41:33] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [01:41:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo [01:42:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242568 (owner: 10Matthias Mullie) [01:50:12] (03PS3) 10Aaron Schulz: Copy rest_v1-wikimedia.json to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224228 (https://phabricator.wikimedia.org/T418188) [01:50:42] (03CR) 10ArielGlenn: [C:03+1] "Dunno why gerrit removed my vote. Still valid." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239669 (owner: 10Daniel Kinzler) [01:50:52] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [01:50:55] (03PS2) 10Aaron Schulz: Switch math sandbox specs to plain wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224253 (https://phabricator.wikimedia.org/T418188) [01:51:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo [01:52:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224253 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [02:01:14] (03PS1) 10Aaron Schulz: [DNM] Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) [02:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:08:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.89% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.17 [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1242581 (https://phabricator.wikimedia.org/T413808) [02:08:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.17 [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1242581 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [02:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:22:12] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.17 [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1242581 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [02:22:58] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11643245 (10Papaul) Both switches are now running version 25.10.2. Still can not get the Cookbook sre.network.tls to pass on asw1-23-ulsfo. [02:33:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0300) [03:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0400) [04:15:15] (03PS1) 10Aaron Schulz: [DNM] Add growthexperiments.v0 to $wgRestSandboxSpecs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0500) [05:01:12] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.14 (duration: 01m 10s) [05:08:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:36:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:54:29] (03PS1) 10Marostegui: Revert "pc1011,pc2011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1242748 [05:55:24] (03CR) 10Marostegui: [C:03+2] Revert "pc1011,pc2011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1242748 (owner: 10Marostegui) [05:56:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc1011: Repooling pc1 after migration to Debian trixie [05:56:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:56:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:56:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1011: Repooling pc1 after migration to Debian trixie [06:04:48] (03CR) 10Marostegui: "@fceratto@wikimedia.org could you please review this? I'd like to push it before the week ends." [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [06:06:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Schema change [06:08:02] !log Deploy schema change on dbstore1007:3314 T415786 [06:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:06] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:17:27] (03PS1) 10Marostegui: site.pp: Reorganize pc8 host. [puppet] - 10https://gerrit.wikimedia.org/r/1242764 [06:18:15] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1242764 (owner: 10Marostegui) [06:18:17] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize pc8 host. [puppet] - 10https://gerrit.wikimedia.org/r/1242764 (owner: 10Marostegui) [06:38:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:36] jouncebot: nowandnext [06:59:37] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [06:59:37] In 0 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0700) [06:59:37] In 0 hour(s) and 0 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0700) [07:00:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242424 (https://phabricator.wikimedia.org/T374718) (owner: 10Kosta Harlan) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0700) [07:00:05] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0700). [07:02:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T415786)', diff saved to https://phabricator.wikimedia.org/P88998 and previous config saved to /var/cache/conftool/dbconfig/20260224-070241-marostegui.json [07:02:46] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:17:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P88999 and previous config saved to /var/cache/conftool/dbconfig/20260224-071750-marostegui.json [07:20:37] (03CR) 10Muehlenhoff: [C:03+2] admin: hashar: disable fetch.prunetags [puppet] - 10https://gerrit.wikimedia.org/r/1242261 (https://phabricator.wikimedia.org/T418085) (owner: 10Hashar) [07:28:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [07:29:57] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete acct toil class [puppet] - 10https://gerrit.wikimedia.org/r/1242292 (owner: 10Muehlenhoff) [07:32:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P89000 and previous config saved to /var/cache/conftool/dbconfig/20260224-073258-marostegui.json [07:37:37] 06SRE, 06ServiceOps new, 10ServiceOps-Mediawiki: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200 (10JMeybohm) 03NEW [07:38:23] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete config override for git protocol v2 [puppet] - 10https://gerrit.wikimedia.org/r/1242299 (owner: 10Muehlenhoff) [07:44:02] (03CR) 10Muehlenhoff: [C:03+2] Remove support for prometheus node exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/1242297 (owner: 10Muehlenhoff) [07:47:48] (03CR) 10Gehel: [C:03+1] Move an HDFS journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [07:48:03] (03CR) 10Gehel: [C:03+1] Move a second journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242511 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [07:48:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T415786)', diff saved to https://phabricator.wikimedia.org/P89001 and previous config saved to /var/cache/conftool/dbconfig/20260224-074806-marostegui.json [07:48:11] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:48:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2248.codfw.wmnet with reason: Maintenance [07:48:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2248 (T415786)', diff saved to https://phabricator.wikimedia.org/P89002 and previous config saved to /var/cache/conftool/dbconfig/20260224-074831-marostegui.json [07:48:54] (03CR) 10Muehlenhoff: [C:03+2] udev: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1242294 (owner: 10Muehlenhoff) [07:49:28] (03CR) 10Gehel: "Can we remove the nodes from the network topology AND put them in setup at the same time? Or do we need to apply the network topology firs" [puppet] - 10https://gerrit.wikimedia.org/r/1242513 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [07:49:54] (03CR) 10Brouberol: [C:03+1] Move an HDFS journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [07:50:11] (03CR) 10Gehel: [C:03+1] Add the configuration for the new dse-k8s worker nodes that were an-worker [puppet] - 10https://gerrit.wikimedia.org/r/1242514 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [07:50:13] (03CR) 10Brouberol: [C:03+1] Move a second journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242511 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [07:50:50] (03CR) 10Brouberol: [C:03+1] Prepare to decom the old an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242513 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [07:51:44] (03CR) 10Brouberol: "Once these hosts get configured as dse-k8s-workers, containers will start getting scheduled on them. Can we make sure that non of them hav" [puppet] - 10https://gerrit.wikimedia.org/r/1242514 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [07:52:52] (03CR) 10Brouberol: Add the new druid-internal servers to site.pp and preseed.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) (owner: 10Btullis) [07:53:10] (03CR) 10Gehel: [C:04-1] Add the new druid-internal servers to site.pp and preseed.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) (owner: 10Btullis) [07:53:11] (03CR) 10Brouberol: [C:03+1] Add dbstore1010 to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1242533 (https://phabricator.wikimedia.org/T417948) (owner: 10Btullis) [07:56:26] (03PS2) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242568 [07:56:48] (03CR) 10Muehlenhoff: [C:03+2] base::kernel: Unconditionally use the autoremove logic [puppet] - 10https://gerrit.wikimedia.org/r/1239696 (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T0800). [08:00:05] Pppery, matthiasmullie, and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:16] o/ [08:01:59] (03CR) 10Muehlenhoff: [C:03+2] Unconditionally install puppet-module-puppetlabs-augeas-core [puppet] - 10https://gerrit.wikimedia.org/r/1239889 (owner: 10Muehlenhoff) [08:04:27] hi [08:05:15] I will sync out my config patch towards the end of this window [08:05:21] stepping away for a while now, though [08:05:55] I'll get started with my patch now [08:06:19] (03CR) 10Muehlenhoff: [C:03+2] nftables: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219877 (owner: 10Muehlenhoff) [08:06:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242568 (owner: 10Matthias Mullie) [08:06:53] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1242407 (owner: 10Muehlenhoff) [08:07:50] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242568 (owner: 10Matthias Mullie) [08:08:16] !log mlitn@deploy2002 Started scap sync-world: Backport for [[gerrit:1242568|Squashed diff to master]] [08:10:13] (03CR) 10Federico Ceratto: [C:03+1] "Puppet is now enabled (doublechecked using sr.puppet(sr.remote().query('db2230*')).check_enabled() ) so the CI can be run again." [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [08:10:16] !log mlitn@deploy2002 mlitn: Backport for [[gerrit:1242568|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:11:44] !log mlitn@deploy2002 mlitn: Continuing with sync [08:15:39] !log mlitn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242568|Squashed diff to master]] (duration: 07m 23s) [08:17:09] I'm done - outstanding patches: pppery (not here?) & kostajh (will get to his near end of window) [08:17:53] Oh, I just realized I have another one :D [08:18:15] (03PS1) 10Matthias Mullie: Minerva TOC: reserve space for the article page heading button [extensions/MobileFrontend] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243004 (https://phabricator.wikimedia.org/T417932) [08:20:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243004 (https://phabricator.wikimedia.org/T417932) (owner: 10Matthias Mullie) [08:29:51] (03PS4) 10Federico Ceratto: service, trafficserver: Prepare "linked-artifacts" k8s pod [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) [08:33:30] (03Merged) 10jenkins-bot: Minerva TOC: reserve space for the article page heading button [extensions/MobileFrontend] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243004 (https://phabricator.wikimedia.org/T417932) (owner: 10Matthias Mullie) [08:33:49] !log mlitn@deploy2002 Started scap sync-world: Backport for [[gerrit:1243004|Minerva TOC: reserve space for the article page heading button (T417932)]] [08:33:53] T417932: [Minerva TOC] design feedback - https://phabricator.wikimedia.org/T417932 [08:34:36] (03PS1) 10Slyngshede: Allow blacklisting of domains for signup [software/bitu] - 10https://gerrit.wikimedia.org/r/1243007 (https://phabricator.wikimedia.org/T418201) [08:35:40] !log mlitn@deploy2002 mlitn: Backport for [[gerrit:1243004|Minerva TOC: reserve space for the article page heading button (T417932)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:36:27] !log mlitn@deploy2002 mlitn: Continuing with sync [08:40:07] (03CR) 10Muehlenhoff: [C:03+2] Remove HPSA RAID support [puppet] - 10https://gerrit.wikimedia.org/r/1237499 (owner: 10Muehlenhoff) [08:40:22] !log mlitn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243004|Minerva TOC: reserve space for the article page heading button (T417932)]] (duration: 06m 33s) [08:40:26] T417932: [Minerva TOC] design feedback - https://phabricator.wikimedia.org/T417932 [08:42:05] I'm done - outstanding patches: pppery (not here?) & kostajh (will get to his near end of window) [08:45:30] 06SRE, 06Infrastructure-Foundations: decom cookbook used Junos commands on a Nokia switch - https://phabricator.wikimedia.org/T417428#11643702 (10ayounsi) 05Open→03Resolved This is done. [08:56:34] (03CR) 10Arnaudb: [C:03+2] gerrit: prevent NodeTextfileStale alert on nft throttling [alerts] - 10https://gerrit.wikimedia.org/r/1242413 (https://phabricator.wikimedia.org/T418139) (owner: 10Arnaudb) [08:57:37] ok, syncing my patch now [08:57:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242424 (https://phabricator.wikimedia.org/T374718) (owner: 10Kosta Harlan) [08:58:09] (03Merged) 10jenkins-bot: gerrit: prevent NodeTextfileStale alert on nft throttling [alerts] - 10https://gerrit.wikimedia.org/r/1242413 (https://phabricator.wikimedia.org/T418139) (owner: 10Arnaudb) [08:58:40] (03Merged) 10jenkins-bot: IPInfo: Grant ipinfo-view-arbitrary-ip to checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242424 (https://phabricator.wikimedia.org/T374718) (owner: 10Kosta Harlan) [08:58:58] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1242424|IPInfo: Grant ipinfo-view-arbitrary-ip to checkuser group (T374718)]] [08:59:03] T374718: Allow Special:IPInfo to return IP information of arbitrary addresses for users with the correct permissions - https://phabricator.wikimedia.org/T374718 [09:00:52] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [09:01:10] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [09:01:12] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1242424|IPInfo: Grant ipinfo-view-arbitrary-ip to checkuser group (T374718)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:04:18] !log kharlan@deploy2002 kharlan: Continuing with sync [09:07:34] (03PS4) 10Marostegui: mariadb: Alert on pt-heartbeat not running [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) [09:08:27] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242424|IPInfo: Grant ipinfo-view-arbitrary-ip to checkuser group (T374718)]] (duration: 09m 29s) [09:08:32] T374718: Allow Special:IPInfo to return IP information of arbitrary addresses for users with the correct permissions - https://phabricator.wikimedia.org/T374718 [09:09:36] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [09:09:42] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:11:58] (03CR) 10Marostegui: "PCC reran: https://puppet-compiler.wmflabs.org/output/1240680/5893/" [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [09:12:33] (03PS1) 10Brouberol: growhbook: disable frontend telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243023 (https://phabricator.wikimedia.org/T418211) [09:13:48] (03CR) 10Jaime Nuche: [C:03+1] "Thank you Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [09:17:27] (03CR) 10Muehlenhoff: [C:03+2] analytics::cluster::packages::common: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1237493 (owner: 10Muehlenhoff) [09:25:42] (03PS1) 10Muehlenhoff: apt: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243035 [09:28:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1243007 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [09:28:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243035 (owner: 10Muehlenhoff) [09:28:50] (03PS5) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [09:30:28] 06SRE, 10Bitu, 06Infrastructure-Foundations, 13Patch-For-Review: wikimedia-l was signed up for a developer account - https://phabricator.wikimedia.org/T418201#11644006 (10Peachey88) [09:30:50] (03CR) 10Brouberol: [C:04-1] "Now that the signature of the `get_next_clusters_nodes` method has been changed, the change needs to be reflected here as well" [cookbooks] - 10https://gerrit.wikimedia.org/r/1235113 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [09:31:46] (03PS6) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [09:34:19] (03CR) 10Arnaudb: gerrit: swap gerrit-spare and gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1242269 (https://phabricator.wikimedia.org/T406334) (owner: 10Arnaudb) [09:34:33] (03CR) 10Slyngshede: [C:03+2] Allow blacklisting of domains for signup [software/bitu] - 10https://gerrit.wikimedia.org/r/1243007 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [09:34:38] (03PS7) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [09:35:24] (03PS1) 10Brouberol: deployment_server: provision the dse-k8s opensearch-operator-3 kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1243041 (https://phabricator.wikimedia.org/T418176) [09:36:08] (03PS1) 10Filippo Giunchedi: hieradata: route toolhub probe alerts to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1243042 (https://phabricator.wikimedia.org/T316682) [09:36:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:16] (03PS1) 10Muehlenhoff: ferm: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/1243045 [09:37:26] (03Merged) 10jenkins-bot: Allow blacklisting of domains for signup [software/bitu] - 10https://gerrit.wikimedia.org/r/1243007 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [09:38:00] (03PS1) 10Brouberol: dse-k8s: define the opensearch-operator-3 namespace to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243046 (https://phabricator.wikimedia.org/T418176) [09:38:54] (03CR) 10Dpogorzelski: [C:03+2] kserve: fix dependency on cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242439 (owner: 10Dpogorzelski) [09:40:07] (03PS8) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [09:40:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243045 (owner: 10Muehlenhoff) [09:41:58] (03PS1) 10Awight: Subreferencing pilot wikis, phase 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243047 (https://phabricator.wikimedia.org/T418209) [09:42:10] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host dborch1003.eqiad.wmnet [09:42:12] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [09:42:21] (03PS9) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [09:42:49] (03PS1) 10Muehlenhoff: mtail: Use the Debian version of mtail universally [puppet] - 10https://gerrit.wikimedia.org/r/1243048 [09:43:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243047 (https://phabricator.wikimedia.org/T418209) (owner: 10Awight) [09:43:43] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [09:43:44] (03CR) 10Majavah: ferm: Remove obsolete OS check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243045 (owner: 10Muehlenhoff) [09:44:52] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:45:21] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [09:45:33] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [09:45:37] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host dborch1003.eqiad.wmnet [09:45:42] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host dborch1003.eqiad.wmnet [09:45:45] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [09:47:00] (03CR) 10Marostegui: "As you +1ed and the PCC looks good on db2230 I am merging!" [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [09:47:10] (03CR) 10Marostegui: [C:03+2] mariadb: Alert on pt-heartbeat not running [puppet] - 10https://gerrit.wikimedia.org/r/1240680 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [09:48:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243048 (owner: 10Muehlenhoff) [09:50:23] (03CR) 10Majavah: hieradata: route toolhub probe alerts to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243042 (https://phabricator.wikimedia.org/T316682) (owner: 10Filippo Giunchedi) [09:51:14] (03CR) 10Arnaudb: [C:03+2] gerrit: alert for broken replication [alerts] - 10https://gerrit.wikimedia.org/r/1242399 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [09:51:39] fceratto@cumin1003 makevm (PID 535318) is awaiting input [09:51:41] (03PS1) 10Dpogorzelski: ml-serve-eqiad: k8s upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1243053 [09:53:50] (03PS1) 10Dpogorzelski: ml-serve-eqiad: k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243054 [09:53:59] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dborch1003.eqiad.wmnet - fceratto@cumin1003" [09:54:04] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dborch1003.eqiad.wmnet - fceratto@cumin1003" [09:54:04] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:04] !log fceratto@cumin1003 START - Cookbook sre.dns.wipe-cache dborch1003.eqiad.wmnet on all recursors [09:54:08] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dborch1003.eqiad.wmnet on all recursors [09:54:34] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in eqiad/ml-serve-eqiad: maintenance [09:54:35] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dborch1003.eqiad.wmnet - fceratto@cumin1003" [09:54:40] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM dborch1003.eqiad.wmnet - fceratto@cumin1003" [09:55:21] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in eqiad/ml-serve-eqiad: maintenance [09:55:40] (03Merged) 10jenkins-bot: gerrit: alert for broken replication [alerts] - 10https://gerrit.wikimedia.org/r/1242399 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [09:55:42] !log dpogorzelski@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=recommendation-api,name=codfw [09:55:52] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:55:56] (03PS2) 10Elukey: profile::httpbb::docker-registry: improve tests [puppet] - 10https://gerrit.wikimedia.org/r/1242452 (https://phabricator.wikimedia.org/T414576) [09:56:26] (03CR) 10Elukey: "Reworked all the tests and tested them on cumin1003 :)" [puppet] - 10https://gerrit.wikimedia.org/r/1242452 (https://phabricator.wikimedia.org/T414576) (owner: 10Elukey) [09:56:38] (03CR) 10Btullis: [C:03+2] Move an HDFS journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [09:56:52] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:57:14] (03PS10) 10Ayounsi: WIP: create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [09:57:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:57:41] fceratto@cumin1003 makevm (PID 535318) is awaiting input [09:59:52] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:00:23] (03CR) 10Filippo Giunchedi: hieradata: route toolhub probe alerts to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243042 (https://phabricator.wikimedia.org/T316682) (owner: 10Filippo Giunchedi) [10:00:30] (03CR) 10JMeybohm: [C:03+1] profile::httpbb::docker-registry: improve tests [puppet] - 10https://gerrit.wikimedia.org/r/1242452 (https://phabricator.wikimedia.org/T414576) (owner: 10Elukey) [10:00:36] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster ml-serve-eqiad: Kubernetes upgrade [10:01:07] (03CR) 10Elukey: [C:03+2] profile::httpbb::docker-registry: improve tests [puppet] - 10https://gerrit.wikimedia.org/r/1242452 (https://phabricator.wikimedia.org/T414576) (owner: 10Elukey) [10:01:15] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Subreferencing pilot wikis, phase 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243047 (https://phabricator.wikimedia.org/T418209) (owner: 10Awight) [10:02:09] (03CR) 10Hashar: gerrit: alert for broken replication (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1242399 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [10:02:58] !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host dborch1003.eqiad.wmnet with OS trixie [10:02:58] (03CR) 10Dpogorzelski: [C:03+2] ml-serve-eqiad: k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243054 (owner: 10Dpogorzelski) [10:03:36] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve1010.eqiad.wmnet, ml-serve1009.eqiad.wmnet, ml-serve1007.eqiad.wmnet, ml-serve1003.eqiad.wmnet, ml-serve1006.eqiad.wmnet, ml-serve1004.eqiad.wmnet are marked down but pooled: k8s-ingress-ml-serve_31443: Servers ml-serve1007.eqiad.wmnet, ml-serve1005.eqiad.wmnet, ml-serve1003.eqiad.wmnet, ml-serve1011.eqiad.wmnet, ml-s [10:03:36] .eqiad.wmnet, ml-serve1002.eqiad.wmnet are marked down but pooled: ml-ctrl_6443: Servers ml-serve-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:03:38] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-ctrl_6443: Servers ml-serve-ctrl1002.eqiad.wmnet are marked down but pooled: k8s-ingress-ml-serve_31443: Servers ml-serve1008.eqiad.wmnet, ml-serve1009.eqiad.wmnet, ml-serve1005.eqiad.wmnet, ml-serve1011.eqiad.wmnet, ml-serve1006.eqiad.wmnet, ml-serve1002.eqiad.wmnet are marked down but pooled: inference_30443: Servers ml-serve1008.eqiad.wmnet, ml [10:03:38] 07.eqiad.wmnet, ml-serve1005.eqiad.wmnet, ml-serve1006.eqiad.wmnet, ml-serve1011.eqiad.wmnet, ml-serve1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:05:13] (03CR) 10Dpogorzelski: [C:03+2] ml-serve-eqiad: k8s upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1243053 (owner: 10Dpogorzelski) [10:05:34] dpogorzelski@cumin1003 wipe-cluster (PID 551887) is awaiting input [10:06:19] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-serve-eqiad: k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243054 (owner: 10Dpogorzelski) [10:06:53] 06SRE, 06ServiceOps new, 07Kubernetes, 13Patch-For-Review: Failing docker registry httpbb tests - https://phabricator.wikimedia.org/T414576#11644188 (10elukey) 05Open→03Resolved ` elukey@cumin1003:~$ sudo httpbb --hosts registry2004.codfw.wmnet /srv/deployment/httpbb-tests/docker-registry/test_dock... [10:13:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:17:36] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:17:46] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:17:55] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:18:20] dpogorzelski@cumin1003 wipe-cluster (PID 551887) is awaiting input [10:18:23] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:18:53] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:19:22] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:19:38] (03CR) 10Arnaudb: [C:03+2] gerrit: alert for broken replication (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1242399 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [10:21:39] (03PS3) 10Arnaudb: gerrit: remove code for having multiple daemon users [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [10:21:45] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:21:52] (03PS3) 10Dzahn: releases: upgrade Java version from 17 to 21 [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) [10:21:53] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:22:01] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:22:14] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [10:22:46] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:22:56] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:23:42] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:23:53] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:23:58] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:24:10] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:24:13] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:27:24] (03PS1) 10MVernon: apus: add two new frontends in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1243066 (https://phabricator.wikimedia.org/T416387) [10:27:27] (03PS1) 10MVernon: apus: remove two codfw frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1243067 (https://phabricator.wikimedia.org/T416387) [10:28:46] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:28:47] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:30:00] (03CR) 10MVernon: "One query, which may be me missing something :)" [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [10:30:58] (03CR) 10Marostegui: [C:03+1] apus: add two new frontends in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1243066 (https://phabricator.wikimedia.org/T416387) (owner: 10MVernon) [10:32:41] (03CR) 10MVernon: [C:03+2] apus: add two new frontends in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1243066 (https://phabricator.wikimedia.org/T416387) (owner: 10MVernon) [10:33:41] (03PS1) 10Muehlenhoff: Remove OS check for nrpe2nodexp [puppet] - 10https://gerrit.wikimedia.org/r/1243068 [10:34:01] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:34:39] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:34:40] !log slyngshede@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2045.codfw.wmnet with OS trixie [10:34:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:34:59] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:35:10] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [10:36:14] !log slyngshede@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2045.codfw.wmnet with OS trixie [10:38:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:41] (03PS1) 10Muehlenhoff: syslog::remote: Remove buster workarounds [puppet] - 10https://gerrit.wikimedia.org/r/1243069 [10:41:42] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:42:48] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221 (10EMcFarland-WMF) 03NEW [10:44:15] (03CR) 10Clément Goubert: [C:03+1] envoy: Allow inboundonly drain and support min wait time [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:44:33] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:45:11] (03CR) 10Clément Goubert: [C:03+1] mesh: Set traffic_direction to INBOUND on local TLS listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:46:16] (03PS1) 10Marostegui: db2230: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1243070 (https://phabricator.wikimedia.org/T285079) [10:46:27] (03CR) 10Clément Goubert: [C:03+1] mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:46:44] (03CR) 10Clément Goubert: [C:03+1] mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:48:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243068 (owner: 10Muehlenhoff) [10:48:06] (03Abandoned) 10Clément Goubert: mw-debug: Immediately drain envoy on termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242354 (https://phabricator.wikimedia.org/T364245) (owner: 10Clément Goubert) [10:48:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:49:24] dpogorzelski@cumin1003 wipe-cluster (PID 551887) is awaiting input [10:49:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243069 (owner: 10Muehlenhoff) [10:51:04] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dborch1003.eqiad.wmnet with OS trixie [10:51:04] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host dborch1003.eqiad.wmnet [10:51:06] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [10:51:18] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host dborch1003.eqiad.wmnet [10:51:19] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [10:51:34] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [10:51:52] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:52:02] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [10:52:17] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:53:38] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [10:53:45] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:54:03] (03CR) 10Clément Goubert: [C:03+1] mesh: Support injection of extra env vars into envoy container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:54:04] (03CR) 10Muehlenhoff: [C:03+1] "Note that Bookworm doesn't provide Java 21 natively, this is a locally maintained component. We're updating it mostly to allow the Gerrit " [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [10:54:09] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [10:54:14] (03CR) 10Clément Goubert: [C:03+1] mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:54:23] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:54:42] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:54:55] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [10:55:05] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:55:33] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:55:51] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:56:11] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:56:23] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:56:35] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:56:39] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:56:49] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:56:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T415786)', diff saved to https://phabricator.wikimedia.org/P89003 and previous config saved to /var/cache/conftool/dbconfig/20260224-105651-marostegui.json [10:56:56] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [10:57:03] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:57:05] fceratto@cumin1003 makevm (PID 606166) is awaiting input [10:57:20] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:58:57] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster ml-serve-eqiad: Kubernetes upgrade [10:59:15] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in eqiad/ml-serve-eqiad: maintenance [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1100) [11:00:12] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in eqiad/ml-serve-eqiad: maintenance [11:00:17] !log dpogorzelski@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=recommendation-api,name=eqiad [11:00:23] !log dpogorzelski@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=recommendation-api,name=codfw [11:01:39] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:01:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:02:41] (03CR) 10Muehlenhoff: Inform about gitlab profile updating quirks (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1242389 (https://phabricator.wikimedia.org/T416898) (owner: 10Slyngshede) [11:04:30] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade Kafka to version 3.5 - https://phabricator.wikimedia.org/T416669#11644404 (10JMeybohm) [11:07:39] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:07:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:08:29] (03PS1) 10JMeybohm: sre.k8s.pool-depool-node: Fix type annotation [cookbooks] - 10https://gerrit.wikimedia.org/r/1243075 (https://phabricator.wikimedia.org/T410537) [11:12:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P89005 and previous config saved to /var/cache/conftool/dbconfig/20260224-111159-marostegui.json [11:12:24] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=apus,name=apus-fe2004.codfw.wmnet [11:12:39] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:12:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:13:20] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [11:13:43] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=apus,name=apus-fe2005.codfw.wmnet [11:13:55] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe2004.codfw.wmnet [11:14:01] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe2005.codfw.wmnet [11:14:03] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in eqiad/ml-serve-eqiad: maintenance [11:14:03] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in eqiad/ml-serve-eqiad: maintenance [11:14:43] (03PS2) 10Fabfur: hiera: test haproxy 3.0 on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1242427 [11:15:18] (03CR) 10Muehlenhoff: [C:03+2] Run IDM spec tests on Bookworm/Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1240840 (owner: 10Muehlenhoff) [11:17:45] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11644447 (10IBerker-WMF) As Riku's manager, I approve. [11:20:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:21:31] !log depool moss-fe200{1,2} prep for decommissioning T416387 [11:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:36] T416387: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387 [11:24:15] (03CR) 10Jcrespo: [C:03+1] apus: remove two codfw frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1243067 (https://phabricator.wikimedia.org/T416387) (owner: 10MVernon) [11:24:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:24:56] (03CR) 10MVernon: [C:03+2] apus: remove two codfw frontends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1243067 (https://phabricator.wikimedia.org/T416387) (owner: 10MVernon) [11:26:28] fceratto@cumin1003 makevm (PID 606166) is awaiting input [11:26:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:27:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P89006 and previous config saved to /var/cache/conftool/dbconfig/20260224-112708-marostegui.json [11:27:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:28:07] (03PS1) 10Ayounsi: [WIP] Add depool strategy for rack depool cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1243077 (https://phabricator.wikimedia.org/T327300) [11:28:29] (03CR) 10Muehlenhoff: [C:03+2] puppetdb: Drop firewall rule for access to Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1239647 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:29:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts moss-fe[2001-2002].codfw.wmnet [11:32:39] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:33:57] (03PS2) 10Slyngshede: Inform about gitlab profile updating quirks [software/bitu] - 10https://gerrit.wikimedia.org/r/1242389 (https://phabricator.wikimedia.org/T416898) [11:34:06] (03CR) 10Slyngshede: Inform about gitlab profile updating quirks (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1242389 (https://phabricator.wikimedia.org/T416898) (owner: 10Slyngshede) [11:36:04] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:36:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:38:32] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:38:36] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host dborch1003.eqiad.wmnet [11:39:44] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:39:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:40:28] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1204.eqiad.wmnet [11:42:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T415786)', diff saved to https://phabricator.wikimedia.org/P89007 and previous config saved to /var/cache/conftool/dbconfig/20260224-114217-marostegui.json [11:42:21] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:42:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1260.eqiad.wmnet with reason: Maintenance [11:42:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T415786)', diff saved to https://phabricator.wikimedia.org/P89008 and previous config saved to /var/cache/conftool/dbconfig/20260224-114242-marostegui.json [11:45:09] mvernon@cumin2002 decommission (PID 2599856) is awaiting input [11:47:17] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:48:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1204.eqiad.wmnet [11:49:09] mvernon@cumin2002 decommission (PID 2599856) is awaiting input [11:52:02] !log slyngshede@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2045.codfw.wmnet with OS trixie [11:52:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:46] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [11:55:32] (03PS5) 10Tiziano Fogli: ldap_users_sync.py: add non-blocking errors handling [puppet] - 10https://gerrit.wikimedia.org/r/1243063 (https://phabricator.wikimedia.org/T418118) [11:55:48] (03PS1) 10Tiziano Fogli: ldap_users_sync.py: format code [puppet] - 10https://gerrit.wikimedia.org/r/1243062 (https://phabricator.wikimedia.org/T418118) [12:01:44] (03CR) 10Muehlenhoff: [C:03+1] "Look good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1242389 (https://phabricator.wikimedia.org/T416898) (owner: 10Slyngshede) [12:02:49] mvernon@cumin2002 decommission (PID 2599856) is awaiting input [12:03:05] (03PS1) 10Muehlenhoff: Remove various Hiera files only necessary for Puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/1243087 (https://phabricator.wikimedia.org/T365798) [12:03:16] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-node: Fix type annotation [cookbooks] - 10https://gerrit.wikimedia.org/r/1243075 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [12:05:26] !log slyngshede@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2045.codfw.wmnet with OS trixie [12:05:45] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [12:05:51] (03CR) 10Clément Goubert: [C:03+1] Switch math sandbox specs to plain wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224253 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [12:06:25] (03PS1) 10Muehlenhoff: Remove create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/1243090 (https://phabricator.wikimedia.org/T365798) [12:07:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243087 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:10:22] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat" [puppet] - 10https://gerrit.wikimedia.org/r/1243069 (owner: 10Muehlenhoff) [12:14:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243090 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:17:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:11] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/ml-staging-codfw: maintenance [12:23:11] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/ml-staging-codfw: maintenance [12:24:48] (03CR) 10Muehlenhoff: [C:03+2] puppetserver: Update two hooks to the variants from the puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1240924 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:28:25] (03PS1) 10Muehlenhoff: Revert "puppetserver: Update two hooks to the variants from the puppetserver module" [puppet] - 10https://gerrit.wikimedia.org/r/1243100 [12:29:17] (03CR) 10Muehlenhoff: [C:03+2] Revert "puppetserver: Update two hooks to the variants from the puppetserver module" [puppet] - 10https://gerrit.wikimedia.org/r/1243100 (owner: 10Muehlenhoff) [12:29:31] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:29:45] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [12:30:23] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [12:30:40] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [12:30:55] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:32:18] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:32:37] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:32:58] (03PS1) 10JMeybohm: sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) [12:33:39] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:33:48] (03PS4) 10Daniel Kinzler: rest gateway: expose headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) [12:34:41] (03PS5) 10Daniel Kinzler: rest gateway: expose headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) [12:35:02] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:35:05] (03CR) 10Daniel Kinzler: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [12:35:11] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:35:24] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:36:04] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:36:07] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:36:31] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [12:37:04] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:37:17] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:37:38] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:37:59] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:38:19] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:38:29] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:38:43] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:40:14] (03PS4) 10Santiago Faci: test-kitchen kubernetes chart: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) [12:40:18] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [12:41:58] (03CR) 10Btullis: "Thanks. Good question. I've opted to do both at the same time here. The topology change affects the an-master hosts mainly and requires a " [puppet] - 10https://gerrit.wikimedia.org/r/1242513 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [12:42:12] (03CR) 10Matthieulec: [C:03+1] "Thanks for catching that!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1243075 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [12:43:38] (03CR) 10Btullis: [C:03+2] Add dbstore1010 to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1242533 (https://phabricator.wikimedia.org/T417948) (owner: 10Btullis) [12:46:20] (03CR) 10Btullis: Add the new druid-internal servers to site.pp and preseed.yaml (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) (owner: 10Btullis) [12:46:50] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) (owner: 10Btullis) [12:47:42] mvernon@cumin2002 decommission (PID 2599856) is awaiting input [12:48:17] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) (owner: 10Btullis) [12:51:04] (03CR) 10David Caro: "This is breaking puppetdb servers in cloud (tools/toolsbeta), I'll revert and then we can look at it more camly" [puppet] - 10https://gerrit.wikimedia.org/r/1239647 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:51:30] (03PS1) 10David Caro: Revert "puppetdb: Drop firewall rule for access to Puppet 5 servers" [puppet] - 10https://gerrit.wikimedia.org/r/1243106 [12:51:56] (03PS1) 10Muehlenhoff: Reapply "puppetserver: Update two hooks to the variants from the puppetserver module" [puppet] - 10https://gerrit.wikimedia.org/r/1243107 (https://phabricator.wikimedia.org/T365798) [12:52:34] (03CR) 10CI reject: [V:04-1] Reapply "puppetserver: Update two hooks to the variants from the puppetserver module" [puppet] - 10https://gerrit.wikimedia.org/r/1243107 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:52:52] !log fceratto@dns1004 START - running authdns-update [12:54:00] (03PS2) 10David Caro: Revert "puppetdb: Drop firewall rule for access to Puppet 5 servers" [puppet] - 10https://gerrit.wikimedia.org/r/1243106 (https://phabricator.wikimedia.org/T365798) [12:54:20] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243106 (https://phabricator.wikimedia.org/T365798) (owner: 10David Caro) [12:54:36] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-node: Fix type annotation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243075 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [12:55:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243106 (https://phabricator.wikimedia.org/T365798) (owner: 10David Caro) [12:55:51] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2045.codfw.wmnet with OS trixie [12:56:25] (03PS2) 10Arnaudb: gerrit: limit GerritHAProxyServiceUnavailable scope [alerts] - 10https://gerrit.wikimedia.org/r/1243102 (https://phabricator.wikimedia.org/T418084) [12:56:25] (03CR) 10Arnaudb: "Side effect: the newly added rule was triggering `AlertLintProblem` because we don't expose that metric on all sites (https://w.wiki/HyeU)" [alerts] - 10https://gerrit.wikimedia.org/r/1243102 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [12:56:40] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: disable external_services for minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242428 (https://phabricator.wikimedia.org/T414333) (owner: 10Daniel Kinzler) [12:58:56] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: use MINUTE limits in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239669 (owner: 10Daniel Kinzler) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1300) [13:00:15] (03PS2) 10Btullis: Move a second journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242511 (https://phabricator.wikimedia.org/T414948) [13:01:40] (03PS1) 10Dpogorzelski: ml-serve: fix istio/transparentproxy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243112 [13:01:51] !log fceratto@dns1004 START - running authdns-update [13:02:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:30] (03PS1) 10Muehlenhoff: puppetdb: Allow access for cloud puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1243113 (https://phabricator.wikimedia.org/T365798) [13:02:43] (03CR) 10Arnaudb: [C:03+2] gerrit: swap gerrit-replica and gerrit-spare [dns] - 10https://gerrit.wikimedia.org/r/1242268 (https://phabricator.wikimedia.org/T417247) (owner: 10Arnaudb) [13:02:53] !log arnaudb@dns1004 START - running authdns-update [13:04:01] !log fceratto@dns1004 START - running authdns-update [13:04:46] (03CR) 10CI reject: [V:04-1] puppetdb: Allow access for cloud puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1243113 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:04:46] (03CR) 10Arnaudb: [C:03+2] gerrit: swap gerrit-spare and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1242269 (https://phabricator.wikimedia.org/T406334) (owner: 10Arnaudb) [13:05:14] (03CR) 10Arnaudb: [C:03+2] gerrit: disable service on gerrit2002 to reimage [puppet] - 10https://gerrit.wikimedia.org/r/1242272 (https://phabricator.wikimedia.org/T417247) (owner: 10Arnaudb) [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:28] (03PS2) 10Muehlenhoff: puppetdb: Allow access for cloud puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1243113 (https://phabricator.wikimedia.org/T365798) [13:07:38] !log fceratto@dns1004 START - running authdns-update [13:07:45] (03CR) 10Kamila Součková: [C:03+1] "+1 with the "I'm fine with deploying this" hat, but I do not currently have the brain to check that the test functionality is equivalent. " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239972 (owner: 10Daniel Kinzler) [13:08:43] (03CR) 10CI reject: [V:04-1] puppetdb: Allow access for cloud puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1243113 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:09:42] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:10:50] (03PS3) 10Muehlenhoff: puppetdb: Allow access for cloud puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1243113 (https://phabricator.wikimedia.org/T365798) [13:11:16] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [13:12:17] FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:13:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:00] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:14:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243113 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:14:26] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:14:32] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:15:15] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Deploy manual changes from netbox - fceratto@cumin1003" [13:15:34] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:25] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243113 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:16:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:44] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:44] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:44] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:44] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:44] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:46] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:46] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:16:46] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 092081508dde94683e62f13137da8749ac4dfc7c, dns.git is 3e0cdc75cf6c0cffabb6e1f0fa146fd2ac0f7fa5) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:18:19] fceratto@cumin1003 netbox (PID 747116) is awaiting input [13:20:01] !log arnaudb@cumin1003 START - Cookbook sre.dns.wipe-cache gerrit-replica.discovery.wmnet gerrit-spare.discovery.wmnet on all recursors [13:20:05] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gerrit-replica.discovery.wmnet gerrit-spare.discovery.wmnet on all recursors [13:20:47] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:21:50] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:22:17] FIRING: [12x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:46] FIRING: GerritReplicationUnavailable: Gerrit replication on gerrit.wikimedia.org:443 is lagging for more than 15 minutes. - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritReplicationUnavailable - https://grafana.wikimedia.org/goto/8VXsGHdDR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGerritReplicationUnavailable [13:24:28] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Deploy manual changes from netbox - fceratto@cumin1003" [13:24:28] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:32] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [13:25:03] !log fceratto@dns1004 START - running authdns-update [13:25:56] (03PS1) 10Arnaudb: gerrit: fix discovery record [dns] - 10https://gerrit.wikimedia.org/r/1243119 (https://phabricator.wikimedia.org/T417247) [13:26:08] !log arnaudb@dns1004 START - running authdns-update [13:26:22] !log fceratto@dns1004 END - running authdns-update [13:26:31] (03CR) 10Santiago Faci: test-kitchen kubernetes chart: New config property (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) (owner: 10Santiago Faci) [13:27:04] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:27:32] !log arnaudb@dns1004 END - running authdns-update [13:27:52] !log arnaudb@cumin1003 START - Cookbook sre.dns.wipe-cache gerrit-replica.discovery.wmnet gerrit-spare.discovery.wmnet on all recursors [13:27:55] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gerrit-replica.discovery.wmnet gerrit-spare.discovery.wmnet on all recursors [13:29:32] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:29:46] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-fe[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [13:29:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-fe[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [13:29:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:29:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts moss-fe[2001-2002].codfw.wmnet [13:30:20] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS bookworm [13:30:34] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:30:52] (03CR) 10Muehlenhoff: [C:03+2] puppetdb: Allow access for cloud puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1243113 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:31:42] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:42] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:42] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:42] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:44] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:44] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:44] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:44] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:44] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:45] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:46] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:46] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:31:46] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [13:34:12] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:38:43] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [13:40:22] (03PS1) 10Filippo Giunchedi: pontoon: always fetch project name from keystone [puppet] - 10https://gerrit.wikimedia.org/r/1243125 (https://phabricator.wikimedia.org/T418236) [13:44:53] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:45:53] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:49:50] slyngshede@cumin1003 reimage (PID 775282) is awaiting input [13:50:29] !log arnaudb@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [13:50:35] (03PS2) 10Filippo Giunchedi: pontoon: always fetch project name from keystone [puppet] - 10https://gerrit.wikimedia.org/r/1243125 (https://phabricator.wikimedia.org/T418236) [13:53:38] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2045.codfw.wmnet with OS trixie [13:53:53] (03PS2) 10JMeybohm: sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) [13:54:47] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [13:54:52] (03PS1) 10Michael Große: feat: if Minerva personal menu is enabled, flip discovery site notice [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243127 (https://phabricator.wikimedia.org/T416656) [13:55:10] (03PS2) 10Muehlenhoff: Reapply "Update two hooks to the variants from the puppetserver module" [puppet] - 10https://gerrit.wikimedia.org/r/1243107 (https://phabricator.wikimedia.org/T365798) [13:56:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243127 (https://phabricator.wikimedia.org/T416656) (owner: 10Michael Große) [13:58:10] (03PS3) 10JMeybohm: sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) [13:58:21] (03PS1) 10Muehlenhoff: Remove two spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1243129 [13:59:17] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11645098 (10MoritzMuehlenhoff) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1400) [14:00:05] awight and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] Hi, I can deploy my config patch. [14:00:14] * MichaelG_WMF is here [14:00:21] awight: go ahead [14:00:24] MichaelG_WMF: and i can deploy for you [14:00:33] urbanecm: Thanks! [14:00:43] ack :-) [14:01:05] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [14:01:20] (03CR) 10CI reject: [V:04-1] feat: if Minerva personal menu is enabled, flip discovery site notice [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243127 (https://phabricator.wikimedia.org/T416656) (owner: 10Michael Große) [14:01:38] I'll have a look at the CI failure [14:01:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243047 (https://phabricator.wikimedia.org/T418209) (owner: 10Awight) [14:02:03] git error, unrelated [14:02:08] (03CR) 10Clément Goubert: sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [14:02:17] (03CR) 10Michael Große: "recheck" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243127 (https://phabricator.wikimedia.org/T416656) (owner: 10Michael Große) [14:02:46] (03Merged) 10jenkins-bot: Subreferencing pilot wikis, phase 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243047 (https://phabricator.wikimedia.org/T418209) (owner: 10Awight) [14:03:03] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1243047|Subreferencing pilot wikis, phase 2 (T418209)]] [14:03:08] T418209: Deploy subreferencing: pilot wikis phase 2 - https://phabricator.wikimedia.org/T418209 [14:03:32] (03PS2) 10Btullis: Add the new druid-internal servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) [14:03:51] (03CR) 10CI reject: [V:04-1] sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [14:04:29] MichaelG_WMF: i have a feeling that might be permanent... [14:04:40] ...as i just had it on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1243126 two times in a row [14:05:01] !log awight@deploy2002 awight: Backport for [[gerrit:1243047|Subreferencing pilot wikis, phase 2 (T418209)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:05:05] let's see [14:05:12] (03CR) 10Btullis: [C:03+2] Move a second journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242511 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [14:05:23] (03PS4) 10JMeybohm: sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) [14:05:26] awight: fyi i'm +2ing the backport to save CI time, will wait on handover before touching prod [14:05:29] (03CR) 10Urbanecm: [C:03+2] feat: if Minerva personal menu is enabled, flip discovery site notice [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243127 (https://phabricator.wikimedia.org/T416656) (owner: 10Michael Große) [14:06:40] (03PS2) 10Arnaudb: gerrit: prepare replication resume for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1242275 (https://phabricator.wikimedia.org/T338470) [14:06:49] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [14:07:00] (03PS2) 10Arnaudb: gerrit: resume replication on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1242279 (https://phabricator.wikimedia.org/T417247) [14:07:10] looks good, continuing [14:07:20] urbanecm: makes sense, thanks for the note! [14:07:25] !log awight@deploy2002 awight: Continuing with sync [14:09:33] (03CR) 10JMeybohm: [V:03+2 C:03+2] sre.k8s.pool-depool-node: Fix type annotation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243075 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [14:10:46] * MichaelG_WMF is right back [14:11:20] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243047|Subreferencing pilot wikis, phase 2 (T418209)]] (duration: 08m 16s) [14:11:24] T418209: Deploy subreferencing: pilot wikis phase 2 - https://phabricator.wikimedia.org/T418209 [14:12:21] (03PS3) 10Arnaudb: gerrit: prepare replication resume for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1242275 (https://phabricator.wikimedia.org/T338470) [14:12:21] (03PS3) 10Arnaudb: gerrit: install gerrit and sync-instances [puppet] - 10https://gerrit.wikimedia.org/r/1242279 (https://phabricator.wikimedia.org/T417247) [14:12:26] urbanecm: All yours, thanks for taking on the other deployment [14:12:41] thanks! [14:12:44] waiting on CI [14:12:55] (03CR) 10JMeybohm: [C:03+2] sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [14:14:01] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS bookworm [14:14:26] (03PS4) 10Arnaudb: gerrit: install gerrit and sync-instances [puppet] - 10https://gerrit.wikimedia.org/r/1242279 (https://phabricator.wikimedia.org/T417247) [14:14:48] (03PS4) 10Arnaudb: gerrit: migrate gerrit2 system user to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1242275 (https://phabricator.wikimedia.org/T338470) [14:15:02] MichaelG_WMF: we're not on testwikis, so +2ing will be enough [14:15:13] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-node: Fix type annotation [cookbooks] - 10https://gerrit.wikimedia.org/r/1243075 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [14:15:45] (03CR) 10Arnaudb: "I've added one more step to the relation chain because we were changing the gerrit role too soon the reimage process → that needed to be d" [puppet] - 10https://gerrit.wikimedia.org/r/1242275 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [14:16:12] * MichaelG_WMF is back [14:16:23] urbanecm: yes, that was my understanding as well [14:16:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11645218 (10BTullis) 05Open→03Resolved This is complete now. [14:16:45] in that case, let's wait for CI and be done with it :) [14:16:47] (03CR) 10Arnaudb: [C:03+2] gerrit: migrate gerrit2 system user to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1242275 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [14:16:58] (one of simpler problems to handle for today...) [14:17:41] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS bookworm [14:18:17] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-node: Support racks without L2 adjacency to LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/1243101 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [14:18:33] (03PS1) 10Elukey: ml-services: move revertrisk away from the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243132 [14:18:55] (03PS2) 10Dpogorzelski: ml-serve: fix istio/transparentproxy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243112 [14:19:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11645235 (10BTullis) a:05BTullis→03None [14:19:44] (03Merged) 10jenkins-bot: feat: if Minerva personal menu is enabled, flip discovery site notice [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243127 (https://phabricator.wikimedia.org/T416656) (owner: 10Michael Große) [14:19:59] (03CR) 10Dpogorzelski: [C:03+1] ml-services: move revertrisk away from the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243132 (owner: 10Elukey) [14:20:15] (03PS5) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) [14:20:15] (03PS6) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [14:20:15] (03PS1) 10Tiziano Fogli: thanos/querier (TMP): filter out non local ruler from query configs [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) [14:21:59] MichaelG_WMF: so, should be done [14:22:15] Great, thanks! [14:22:16] (03CR) 10Jgreen: [C:03+1] Fix hostname for frmx SPF records [dns] - 10https://gerrit.wikimedia.org/r/1242532 (https://phabricator.wikimedia.org/T417958) (owner: 10Dwisehaupt) [14:22:52] (03CR) 10CI reject: [V:04-1] Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [14:23:47] (03CR) 10CI reject: [V:04-1] Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [14:23:50] (03CR) 10CI reject: [V:04-1] thanos/querier (TMP): filter out non local ruler from query configs [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [14:23:52] (03PS1) 10Muehlenhoff: thumbor-plugins: Stop using pkg_resources [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1243135 [14:25:52] (03PS2) 10Tiziano Fogli: thanos/querier (TMP): filter out non local ruler from query configs [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) [14:25:52] (03PS6) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) [14:25:52] (03PS7) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [14:26:55] (03PS2) 10Elukey: ml-services: move away from the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243132 [14:27:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243107 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:28:21] (03CR) 10CI reject: [V:04-1] thanos/querier (TMP): filter out non local ruler from query configs [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [14:29:02] (03CR) 10CI reject: [V:04-1] thumbor-plugins: Stop using pkg_resources [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1243135 (owner: 10Muehlenhoff) [14:29:08] (03CR) 10CI reject: [V:04-1] Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [14:29:09] (03CR) 10Muehlenhoff: [C:03+2] syslog::remote: Remove buster workarounds [puppet] - 10https://gerrit.wikimedia.org/r/1243069 (owner: 10Muehlenhoff) [14:29:29] (03CR) 10CI reject: [V:04-1] Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [14:29:38] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [14:29:38] (03CR) 10Dpogorzelski: [C:03+1] ml-services: move away from the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243132 (owner: 10Elukey) [14:31:40] (03Abandoned) 10Elukey: ml-services: move away from the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243132 (owner: 10Elukey) [14:33:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:16] (03PS1) 10Muehlenhoff: wmflib::service::probe::tcp_module_options: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243137 [14:34:27] (03PS1) 10Elukey: ml-services: force Revert Risk to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243138 [14:34:30] !log slyngshede@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2045.codfw.wmnet with OS trixie [14:36:34] (03CR) 10CI reject: [V:04-1] wmflib::service::probe::tcp_module_options: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243137 (owner: 10Muehlenhoff) [14:37:05] (03PS2) 10Federico Ceratto: site.pp: Setup dborch1003 [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) [14:37:30] !log arnaudb@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [14:38:26] (03CR) 10Muehlenhoff: [C:03+2] mariadb::packages_client: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219874 (owner: 10Muehlenhoff) [14:43:55] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [14:43:56] (03CR) 10Dpogorzelski: [C:03+1] ml-services: force Revert Risk to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243138 (owner: 10Elukey) [14:46:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:48:49] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:49:47] (03CR) 10AikoChou: [C:03+1] ml-services: force Revert Risk to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243138 (owner: 10Elukey) [14:50:16] (03CR) 10Muehlenhoff: [C:03+2] docker: Remove check for memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/1223184 (owner: 10Muehlenhoff) [14:50:32] (03CR) 10Dpogorzelski: [C:03+2] ml-services: force Revert Risk to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243138 (owner: 10Elukey) [14:51:47] 06SRE, 06Infrastructure-Foundations, 10Mail: Remove mail alias/fork from dmarc-rua@wikimedia.org to dmarc@donate.wikimedia.org - https://phabricator.wikimedia.org/T417941#11645548 (10Jgreen) >>! In T417941#11636764, @Dzahn wrote: > @Jgreen I removed the dmarc@donate.wikimedia.org line from that alias. > > I... [14:52:32] (03Merged) 10jenkins-bot: ml-services: force Revert Risk to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243138 (owner: 10Elukey) [14:53:01] (03CR) 10Federico Ceratto: "I'm getting an error in the automatically started CI test named "test" due to... missing jpg images it seems." [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [14:53:15] (03CR) 10Federico Ceratto: [V:03+2] site.pp: Setup dborch1003 [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [14:53:25] (03CR) 10Federico Ceratto: site.pp: Setup dborch1003 [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [14:53:33] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:54:03] (03PS2) 10Fabfur: hiera: test haproxy 3.0 on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1242427 [14:54:07] (03PS2) 10Muehlenhoff: wmflib::service::probe::tcp_module_options: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243137 [14:56:17] (03CR) 10CI reject: [V:04-1] wmflib::service::probe::tcp_module_options: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243137 (owner: 10Muehlenhoff) [14:56:23] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1242427 (owner: 10Fabfur) [14:57:24] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11645599 (10BTullis) Just a data point. We're still seeing an ever-increasing value for these open soc... [14:58:04] (03CR) 10Btullis: [C:03+2] Prepare to decom the old an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242513 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [14:58:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:58:38] 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11645606 (10Clement_Goubert) 05Open→03Declined Abandoning as I think these are the hosts we got in the last refresh. [14:58:41] (03CR) 10Btullis: [C:03+2] Add the configuration for the new dse-k8s worker nodes that were an-worker [puppet] - 10https://gerrit.wikimedia.org/r/1242514 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [14:59:05] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1500) [15:00:10] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243140 [15:01:13] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS bookworm [15:01:31] (03CR) 10Arnaudb: [C:03+2] gerrit: install gerrit and sync-instances [puppet] - 10https://gerrit.wikimedia.org/r/1242279 (https://phabricator.wikimedia.org/T417247) (owner: 10Arnaudb) [15:02:21] (03CR) 10AikoChou: [C:03+1] ml-serve: fix istio/transparentproxy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243112 (owner: 10Dpogorzelski) [15:02:37] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:02:56] (03CR) 10Ayounsi: [C:03+1] wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:08:11] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.sync-instances sync Gerrit data from gerrit2003.wikimedia.org to gerrit2002.wikimedia.org [15:09:39] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: fix istio/transparentproxy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243112 (owner: 10Dpogorzelski) [15:11:14] !log arnaudb@cumin1003 END (ERROR) - Cookbook sre.gerrit.sync-instances (exit_code=97) sync Gerrit data from gerrit2003.wikimedia.org to gerrit2002.wikimedia.org [15:11:58] (03CR) 10Federico Ceratto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [15:12:09] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.sync-instances sync Gerrit data from gerrit2003.wikimedia.org to gerrit2002.wikimedia.org [15:15:28] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.sync-instances (exit_code=99) sync Gerrit data from gerrit2003.wikimedia.org to gerrit2002.wikimedia.org [15:16:35] (03PS1) 10Dpogorzelski: ml-services: force articletopic to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243144 [15:17:14] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.sync-instances sync Gerrit data from gerrit2003.wikimedia.org to gerrit2002.wikimedia.org [15:18:00] (03CR) 10Elukey: [C:03+1] ml-services: force articletopic to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243144 (owner: 10Dpogorzelski) [15:18:45] (03CR) 10Dpogorzelski: [C:03+2] ml-services: force articletopic to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243144 (owner: 10Dpogorzelski) [15:19:37] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:20:33] (03PS2) 10Arnaudb: gerrit: resume replication on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1243131 (https://phabricator.wikimedia.org/T417247) [15:22:16] (03CR) 10Herron: [C:03+1] Remove OS check for nrpe2nodexp [puppet] - 10https://gerrit.wikimedia.org/r/1243068 (owner: 10Muehlenhoff) [15:22:47] (03CR) 10Herron: [C:03+1] mtail: Use the Debian version of mtail universally [puppet] - 10https://gerrit.wikimedia.org/r/1243048 (owner: 10Muehlenhoff) [15:23:23] (03CR) 10Herron: [C:03+1] meta-monitoring: add rewrite rule to redirect home to Wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1241014 (https://phabricator.wikimedia.org/T417900) (owner: 10Tiziano Fogli) [15:27:59] (03PS5) 10Brouberol: Use importlib.metadata instead of pkg_resources, now deprecated/removed. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240850 [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1530) [15:30:30] (03PS3) 10Ayounsi: Nokia: add local-as to k8s BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1242410 (https://phabricator.wikimedia.org/T417817) [15:31:58] (03CR) 10Ayounsi: [C:03+2] Nokia: add local-as to k8s BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1242410 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [15:33:17] (03Merged) 10jenkins-bot: Nokia: add local-as to k8s BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1242410 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [15:35:25] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet,phab[1004-1005].eqiad.wmnet with reason: T418256 [15:35:30] T418256: Deploy Phab/Phorge 2026-02-24 - https://phabricator.wikimedia.org/T418256 [15:37:11] (03CR) 10Brouberol: [C:03+1] Add the new druid-internal servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) (owner: 10Btullis) [15:38:10] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker[1119-1130,1135-1141].eqiad.wmnet [15:40:18] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [15:42:23] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [15:42:38] !log dwisehaupt@dns1004 START - running authdns-update [15:43:16] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [15:44:02] !log dwisehaupt@dns1004 END - running authdns-update [15:44:23] !log Remove Phabricator MFA for EMcFarland-WMF (T418260) [15:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:28] T418260: Reset MFA for EMcFarland-WMF on Phabricator - https://phabricator.wikimedia.org/T418260 [15:46:01] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [15:46:32] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [15:46:56] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11646285 (10Dzahn) [15:47:08] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Product Safety and Integrity, and 3 others: Include User-Agent Client Hints in WebRequest logs - https://phabricator.wikimedia.org/T337947#11646291 (10Dreamy_Jazz) [15:47:30] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11646296 (10Dzahn) [15:48:06] (03CR) 10Dwisehaupt: [C:03+2] Fix hostname for frmx SPF records [dns] - 10https://gerrit.wikimedia.org/r/1242532 (https://phabricator.wikimedia.org/T417958) (owner: 10Dwisehaupt) [15:48:38] !log sukhe@dns1004 START - running authdns-update [15:48:56] !log enable IPv6 glue records for ns[02].wikimedia.org: T81605 [15:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:00] T81605: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 [15:49:21] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.sync-instances (exit_code=0) sync Gerrit data from gerrit2003.wikimedia.org to gerrit2002.wikimedia.org [15:50:18] !log sukhe@dns1004 END - running authdns-update [15:51:24] (03CR) 10JHathaway: [C:03+1] apt: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243035 (owner: 10Muehlenhoff) [15:52:18] (03CR) 10JHathaway: [C:03+1] Reapply "Update two hooks to the variants from the puppetserver module" [puppet] - 10https://gerrit.wikimedia.org/r/1243107 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:52:43] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [15:52:45] (03CR) 10JHathaway: [C:03+1] Remove create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/1243090 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:52:50] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2045.codfw.wmnet with OS trixie [15:53:08] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo [15:53:21] (03CR) 10JHathaway: [C:03+1] Remove various Hiera files only necessary for Puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/1243087 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:54:03] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11646397 (10Dzahn) [15:54:19] jouncebot: nowandnext [15:54:19] For the next 0 hour(s) and 5 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1530) [15:54:19] In 0 hour(s) and 5 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1600) [15:54:29] (03PS1) 10Dpogorzelski: ml-services: force revertrisk-multi to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243149 [15:54:33] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11646421 (10Dzahn) [15:54:49] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11646432 (10Dzahn) Thanks. Most things here are done. The SSH key needs to be verifi... [15:55:04] 10SRE-SLO, 06Abstract Wikipedia team, 06serviceops, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11646436 (10Jdforrester-WMF) Over the past 24 hours it's now dropped from 12% to 0.1% and will likely... [15:55:16] (03CR) 10Ssingh: [C:03+2] wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:55:20] (03PS1) 10Volans: wmcs: infra-tracing-nfs improve requests failures [puppet] - 10https://gerrit.wikimedia.org/r/1243151 (https://phabricator.wikimedia.org/T399313) [15:55:38] (03CR) 10Arnaudb: [C:03+2] gerrit: resume replication on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1243131 (https://phabricator.wikimedia.org/T417247) (owner: 10Arnaudb) [15:55:44] !log dzahn@cumin2002 START - Cookbook sre.gerrit.restart-gerrit Restarting Gerrit on gerrit2003 [15:55:50] (03CR) 10CI reject: [V:04-1] wmcs: infra-tracing-nfs improve requests failures [puppet] - 10https://gerrit.wikimedia.org/r/1243151 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [15:56:01] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gerrit.restart-gerrit (exit_code=99) Restarting Gerrit on gerrit2003 [15:56:14] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11646452 (10Rsilvola) Thank you! At the moment, I only expect to do occasional depl... [15:56:42] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T416726#11646457 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @JMeybohm disk has been replaced. [15:57:43] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:58:46] FIRING: [6x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [15:58:57] FIRING: GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in eqiad - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [15:59:50] !log bking@local restarting wdqs codfw main to deal with 5xx errors [15:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jelto, arnoldokoth, mutante, and arnaudb: OwO what's this, a deployment window?? SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1600). nyaa~ [16:00:07] !log gerrit2003 was restarted for maintenance reasons - expecting recovery soon [16:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:18] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:01:13] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:01:27] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, and 2 others: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11646575 (10elukey) Matthew upgraded apus to the latest Reef patch (thanks!) and I tried today to push some Docker images to the new /test prefix: ` elukey@... [16:02:16] !log brennen@deploy2002 Started deploy [phabricator/deployment@aad109e]: deploy phab2002 for T418256 [16:02:20] T418256: Deploy Phab/Phorge 2026-02-24 - https://phabricator.wikimedia.org/T418256 [16:03:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:03:44] !log brennen@deploy2002 Finished deploy [phabricator/deployment@aad109e]: deploy phab2002 for T418256 (duration: 01m 28s) [16:03:46] RESOLVED: [13x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [16:03:57] RESOLVED: [4x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [16:04:04] !log brennen@deploy2002 Started deploy [phabricator/deployment@aad109e]: deploy phab1004 for T418256 [16:04:13] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:05:13] (03PS2) 10Muehlenhoff: ferm: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/1243045 [16:05:16] (03PS2) 10Volans: wmcs: infra-tracing-nfs improve requests failures [puppet] - 10https://gerrit.wikimedia.org/r/1243151 (https://phabricator.wikimedia.org/T399313) [16:05:32] (03CR) 10Muehlenhoff: ferm: Remove obsolete OS check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243045 (owner: 10Muehlenhoff) [16:05:59] !log brennen@deploy2002 Finished deploy [phabricator/deployment@aad109e]: deploy phab1004 for T418256 (duration: 01m 55s) [16:07:41] (03CR) 10CI reject: [V:04-1] ferm: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/1243045 (owner: 10Muehlenhoff) [16:07:43] !log brennen@deploy2002 Started deploy [phabricator/deployment@01119c5]: re-deploy phab2002 for T418256 (for real this time) [16:07:47] T418256: Deploy Phab/Phorge 2026-02-24 - https://phabricator.wikimedia.org/T418256 [16:08:14] (03CR) 10CI reject: [V:04-1] wmcs: infra-tracing-nfs improve requests failures [puppet] - 10https://gerrit.wikimedia.org/r/1243151 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [16:08:14] !log brennen@deploy2002 Finished deploy [phabricator/deployment@01119c5]: re-deploy phab2002 for T418256 (for real this time) (duration: 00m 31s) [16:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:37] !log brennen@deploy2002 Started deploy [phabricator/deployment@01119c5]: re-deploy phab1004 for T418256 (for real this time) [16:09:22] (03CR) 10Hashar: gerrit: alert for broken replication (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1242399 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [16:09:39] !log brennen@deploy2002 Finished deploy [phabricator/deployment@01119c5]: re-deploy phab1004 for T418256 (for real this time) (duration: 01m 01s) [16:10:38] (03PS1) 10Muehlenhoff: Remove now obsolete spec test [puppet] - 10https://gerrit.wikimedia.org/r/1243166 [16:11:12] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:12:28] (03PS1) 10Arnaudb: gerrit: fix gerrit_proxy_spec [puppet] - 10https://gerrit.wikimedia.org/r/1243168 [16:13:01] (03CR) 10Hashar: [C:03+1] "😊" [puppet] - 10https://gerrit.wikimedia.org/r/1243168 (owner: 10Arnaudb) [16:14:12] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:14:26] (03CR) 10Arnaudb: [C:03+2] gerrit: fix gerrit_proxy_spec [puppet] - 10https://gerrit.wikimedia.org/r/1243168 (owner: 10Arnaudb) [16:15:01] (03CR) 10Tiziano Fogli: [C:03+2] meta-monitoring: add rewrite rule to redirect home to Wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1241014 (https://phabricator.wikimedia.org/T417900) (owner: 10Tiziano Fogli) [16:15:09] (03PS3) 10Volans: wmcs: infra-tracing-nfs improve requests failures [puppet] - 10https://gerrit.wikimedia.org/r/1243151 (https://phabricator.wikimedia.org/T399313) [16:15:24] (03CR) 10Tiziano Fogli: [C:03+2] Remove OS check for nrpe2nodexp [puppet] - 10https://gerrit.wikimedia.org/r/1243068 (owner: 10Muehlenhoff) [16:18:26] (03PS4) 10Dzahn: gerrit: remove code for having multiple daemon users [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) [16:20:08] (03PS5) 10Dzahn: gerrit: remove code for having multiple daemon users [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) [16:21:35] (03CR) 10Dzahn: [C:03+1] "following-up with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1242467" [puppet] - 10https://gerrit.wikimedia.org/r/1243168 (owner: 10Arnaudb) [16:22:30] (03CR) 10Elukey: [C:03+1] ml-services: force revertrisk-multi to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243149 (owner: 10Dpogorzelski) [16:25:17] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1242467/8132/" [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [16:27:21] (03CR) 10Ladsgroup: [C:03+1] "The resulting ferm config file is a bit different but makes sense (and might be even even faster given no DNS resolve?): https://puppet-co" [puppet] - 10https://gerrit.wikimedia.org/r/1242430 (owner: 10Muehlenhoff) [16:27:29] 06SRE, 06Infrastructure-Foundations, 10Mail: Remove mail alias/fork from dmarc-rua@wikimedia.org to dmarc@donate.wikimedia.org - https://phabricator.wikimedia.org/T417941#11646872 (10jhathaway) >>! In T417941#11636764, @Dzahn wrote: > @Jgreen I removed the dmarc@donate.wikimedia.org line from that alias. >... [16:30:40] 06SRE, 06Data-Platform-SRE, 10LDAP-Access-Requests: Grant Access to airflow-analytics-ops for akhatun - https://phabricator.wikimedia.org/T418270 (10AKhatun_WMF) 03NEW [16:31:26] btullis@cumin1003 decommission (PID 897548) is awaiting input [16:32:34] (03CR) 10Hashar: [C:03+1] "Excellent! That is a NOOP! :tada:" [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [16:32:53] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, and 2 others: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11646967 (10elukey) Tried also with another big image: ` elukey@build2001:~$ sudo docker push docker-registry.discovery.wmnet/test/amd-gpu-tester:latest The... [16:33:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:02] (03CR) 10Btullis: [C:03+2] Add the new druid-internal servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) (owner: 10Btullis) [16:35:28] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to airflow-analytics-ops for akhatun - https://phabricator.wikimedia.org/T418270#11647007 (10brouberol) [16:35:34] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to airflow-analytics-ops for akhatun - https://phabricator.wikimedia.org/T418270#11647009 (10brouberol) 05Open→03In progress [16:35:45] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to airflow-analytics-ops for akhatun - https://phabricator.wikimedia.org/T418270#11647010 (10brouberol) a:03brouberol [16:36:16] (03CR) 10AikoChou: [C:03+1] ml-services: force revertrisk-multi to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243149 (owner: 10Dpogorzelski) [16:36:40] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11647019 (10MatthewVernon) I spent quite a bit of time with codesearch last quarter trying to track down thumbnail size (ab)use, but... [16:37:22] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to airflow-analytics-ops for akhatun - https://phabricator.wikimedia.org/T418270#11647035 (10brouberol) @AKhatun_WMF Your username is now listed under https://ldap.toolforge.org/group/airflow-analytics-ops. Go to https... [16:39:25] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to airflow-analytics-ops for akhatun - https://phabricator.wikimedia.org/T418270#11647044 (10AKhatun_WMF) Yas! I now have admin access! Thanks. [16:39:47] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to airflow-analytics-ops for akhatun - https://phabricator.wikimedia.org/T418270#11647045 (10brouberol) Nice! [16:39:53] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to airflow-analytics-ops for akhatun - https://phabricator.wikimedia.org/T418270#11647047 (10brouberol) 05In progress→03Resolved [16:42:17] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:43:44] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:46:53] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2002), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:47:18] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11647087 (10Tacsipacsi) >>! In T414805#11640558, @Ladsgroup wrote: > The point is that in order to be cached, it need to have a miss... [16:49:25] btullis@cumin1003 decommission (PID 897548) is awaiting input [16:50:18] (03PS1) 10JHathaway: dmarc: remove unused ruf tags [dns] - 10https://gerrit.wikimedia.org/r/1243174 [16:50:26] (03CR) 10Elukey: locking: Add a mechanism for a global Spicerack lock. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [16:50:37] (03CR) 10Dpogorzelski: [C:03+2] ml-services: force revertrisk-multi to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243149 (owner: 10Dpogorzelski) [16:50:41] (03PS2) 10JHathaway: dmarc: remove unused ruf tags [dns] - 10https://gerrit.wikimedia.org/r/1243174 (https://phabricator.wikimedia.org/T417941) [16:50:54] (03CR) 10MVernon: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1242430 (owner: 10Muehlenhoff) [16:51:45] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1119-1130,1135-1141].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [16:52:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1119-1130,1135-1141].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [16:52:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:52:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker[1119-1130,1135-1141].eqiad.wmnet [16:52:45] (03Merged) 10jenkins-bot: ml-services: force revertrisk-multi to skip the transparent proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243149 (owner: 10Dpogorzelski) [16:54:09] (03PS1) 10Elukey: .wmfconfig: remove Buster [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243175 [16:55:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T415786)', diff saved to https://phabricator.wikimedia.org/P89010 and previous config saved to /var/cache/conftool/dbconfig/20260224-165542-marostegui.json [16:55:48] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [16:57:43] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1700). [17:00:05] A_smart_kitten: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:43] here :) [17:00:48] A_smart_kitten: o/ looking [17:02:27] for my patch, on the subject of testing it, i guess a way to test it post-deployment might be to manually trigger alerts for those components, and see if the task(s) then get filed with the right tags? (if manually triggering alerts like that is possible, that is) [17:04:23] A_smart_kitten: so, in serviceops we're in the middle of redoing our phab workflow (cc matthieulec) -- I want to run the proposal by the team before +2ing, just for social reasons not technical ones :) [17:05:37] sorry for the extra delay, I know it's frustrating especially because I see you got a positive reply on the task already, I just want to make sure we get a chance to discuss [17:06:59] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11647253 (10Dzahn) @Rsilvola Gotcha! We just need to verify it's really you and your... [17:06:59] rzl: sounds okay to me, but thanks for acknowledging the situation re the positive reply on the task :) [I probably assumed it represented the okay from serviceops generally] [17:07:02] do you want me to reschedule in the future for another puppet request window, or should I leave it to serviceops to deploy as/when? [17:08:00] good question -- you can consider this handed off, if the team is happy with it I'll merge it async and no need for another window [17:08:30] rzl: ty, will leave it with you :) [17:08:52] if you don't hear back in, let's say a week, please do ping me directly [17:09:11] will do (probably on the task) [17:09:18] sgtm! [17:10:09] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, and 2 others: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11647274 (10MatthewVernon) At least so far, no issues with sync getting far behind either. [17:10:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P89011 and previous config saved to /var/cache/conftool/dbconfig/20260224-171051-marostegui.json [17:10:52] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221#11647276 (10Dzahn) a:03thcipriani [17:12:23] (03CR) 10Dillon: [C:03+1] Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [17:13:30] 06SRE, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11647285 (10Dzahn) What is the goal to be achieved here? [17:15:06] (03PS1) 10BCornwall: haproxy: Conditionally set cpu-map when >1 CPU [puppet] - 10https://gerrit.wikimedia.org/r/1243180 (https://phabricator.wikimedia.org/T418182) [17:15:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 13Patch-For-Review: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11647290 (10BTullis) a:05BTullis→03None [17:16:36] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8133/co" [puppet] - 10https://gerrit.wikimedia.org/r/1243180 (https://phabricator.wikimedia.org/T418182) (owner: 10BCornwall) [17:17:14] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1117 to dse-k8s-worker1024 [17:17:35] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [17:20:39] (03CR) 10Fabfur: [C:03+1] haproxy: Conditionally set cpu-map when >1 CPU [puppet] - 10https://gerrit.wikimedia.org/r/1243180 (https://phabricator.wikimedia.org/T418182) (owner: 10BCornwall) [17:21:12] (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: remove code for having multiple daemon users [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [17:21:51] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:22:30] (03CR) 10BCornwall: [V:03+1 C:03+2] haproxy: Conditionally set cpu-map when >1 CPU [puppet] - 10https://gerrit.wikimedia.org/r/1243180 (https://phabricator.wikimedia.org/T418182) (owner: 10BCornwall) [17:22:45] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:23:18] btullis@cumin1003 rename (PID 1001646) is awaiting input [17:26:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P89012 and previous config saved to /var/cache/conftool/dbconfig/20260224-172559-marostegui.json [17:28:32] (03CR) 10Federico Ceratto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [17:28:55] (03CR) 10Federico Ceratto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [17:29:22] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:35:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:35:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:35:33] (03PS1) 10Dzahn: backup: adjust gerrit file set after renaming of gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/1243183 (https://phabricator.wikimedia.org/T417247) [17:36:04] (03PS2) 10Dzahn: backup: adjust gerrit file set after renaming of gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/1243183 (https://phabricator.wikimedia.org/T417247) [17:36:48] (03PS1) 10BCornwall: ats: Set secondary nvme drives for new codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1243184 (https://phabricator.wikimedia.org/T401832) [17:39:00] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1243184 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [17:39:00] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1117 to dse-k8s-worker1024 - btullis@cumin1003" [17:41:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T415786)', diff saved to https://phabricator.wikimedia.org/P89013 and previous config saved to /var/cache/conftool/dbconfig/20260224-174107-marostegui.json [17:41:12] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [17:41:25] (03PS1) 10Dzahn: gerrit: cleanup Hiera and tests after gerrit2 renaming [puppet] - 10https://gerrit.wikimedia.org/r/1243187 (https://phabricator.wikimedia.org/T338470) [17:41:26] (03CR) 10BCornwall: [C:03+1] dmarc: remove unused ruf tags [dns] - 10https://gerrit.wikimedia.org/r/1243174 (https://phabricator.wikimedia.org/T417941) (owner: 10JHathaway) [17:41:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1117 to dse-k8s-worker1024 - btullis@cumin1003" [17:41:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:46] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1024 on all recursors [17:41:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1024 on all recursors [17:41:50] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1024 [17:42:23] (03PS1) 10Dzahn: admin: rename gerrit system user [puppet] - 10https://gerrit.wikimedia.org/r/1243188 (https://phabricator.wikimedia.org/T338470) [17:43:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1024 [17:43:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1117 to dse-k8s-worker1024 [17:46:51] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:51:00] (03CR) 10A smart kitten: "[FTR, current status as at T417020#11647567]" [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [17:52:42] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221#11647570 (10thcipriani) Reason for access makes sense: approved for `deployment` group. @EMcFarland-WMF to deploy backports you'll also need to request `spiderpig-access` on https://i... [17:56:42] 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Remove mail alias/fork from dmarc-rua@wikimedia.org to dmarc@donate.wikimedia.org - https://phabricator.wikimedia.org/T417941#11647590 (10Jgreen) 05Open→03Resolved a:03Jgreen >>! In T417941#11646872, @jhathaway wrote: >>>! In T41794... [17:57:17] RESOLVED: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:11] 06SRE, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11647596 (10ajhalili2006) >>! In T418068#11647285, @Dzahn wrote: > What is the goal to be achieved here? Since I manually renamed my Wikimedia developer account in... [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1800) [18:00:27] (03PS5) 10Urbanecm: [Growth] Enable on all open Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239949 (https://phabricator.wikimedia.org/T417023) [18:00:52] (03CR) 10Urbanecm: "rebase done by ignoring conflicts and using `composer manage-dblist update` to re-generate dblists-index.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239949 (https://phabricator.wikimedia.org/T417023) (owner: 10Urbanecm) [18:03:06] (03PS1) 10Urbanecm: feat(DataProvider): Allow logging of read validation failures [extensions/CommunityConfiguration] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243190 (https://phabricator.wikimedia.org/T417893) [18:07:50] (03CR) 10Hashar: "I assume the previously backed up `/var/lib/gerrit2` will remain present in the backup system and this will only apply to the future backu" [puppet] - 10https://gerrit.wikimedia.org/r/1243183 (https://phabricator.wikimedia.org/T417247) (owner: 10Dzahn) [18:11:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11647680 (10wiki_willy) Hi @BTullis - sure, that sounds like a good test plan. One thing to keep in mind thou... [18:17:56] (03PS2) 10Cwhite: admin: add keys for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1242411 [18:18:24] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker[1118,1131,1133-1134].eqiad.wmnet [18:28:04] (03CR) 10Dzahn: "I don't understand why removing some brackets limits it to eqiad and codfw but the intention sounds good." [alerts] - 10https://gerrit.wikimedia.org/r/1243102 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [18:29:45] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [18:30:01] (03CR) 10Dzahn: "Yea, it will affect what will be backed up next time. But also we generally can't expect anything to remain in the backup system for truly" [puppet] - 10https://gerrit.wikimedia.org/r/1243183 (https://phabricator.wikimedia.org/T417247) (owner: 10Dzahn) [18:32:31] 06SRE, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11647766 (10Dzahn) That seems the same thing as just setting a random password and not logging in anymore? [18:34:07] 06SRE, 10Infrastructure Security, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11647782 (10Dzahn) [18:34:35] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11647788 (10A_smart_kitten) >>! In T418068#11647766, @Dzahn wrote: > Not sure if another type of "lock... [18:34:56] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1118,1131,1133-1134].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [18:36:26] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11647792 (10Dzahn) I am not sure if users who are simply not active need to be banned. Leaving that to... [18:37:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1118,1131,1133-1134].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [18:37:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker[1118,1131,1133-1134].eqiad.wmnet [18:40:15] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts dse-k8s-worker1024.eqiad.wmnet [18:42:28] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:43:02] PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:43:02] PROBLEM - Confd vcl based reload on cp2033 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:43:06] uh? [18:43:24] brett: ^ anything to know about? [18:45:18] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [18:46:20] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11647903 (10ssingh) 05Open→03Resolved a:03ssingh With the rollout of ns[02] IPv6 glue records today, we have IPv6 support on all ns[0-2].wikimedia.org. There is some more work here: we h... [18:47:33] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11647908 (10Dzahn) [18:47:37] (03PS5) 10Ssingh: P:bird::anycast: automatically detect IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) [18:47:42] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11647909 (10Dzahn) Thanks! SSH key confirmed out-of-band. [18:48:17] (03PS1) 10BCornwall: varnishkafka: Only enable prom exporter for text [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) [18:49:58] (03CR) 10Ssingh: [V:03+1] "[Still looking for a review" [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [18:50:38] (03PS1) 10Dzahn: admin: add rsilvola to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1243196 (https://phabricator.wikimedia.org/T418004) [18:50:46] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dse-k8s-worker1024.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [18:52:06] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11647948 (10Papaul) a:05Papaul→03ayounsi [18:52:37] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8137/co" [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [18:53:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11647956 (10Dzahn) [18:53:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11647957 (10Dzahn) [18:53:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dse-k8s-worker1024.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [18:53:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:32] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dse-k8s-worker1024.eqiad.wmnet [18:56:27] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [19:00:04] dduvall and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1900). [19:05:29] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:41] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [19:05:52] dancy: pretrain failed due to an unclean `/srv/patches` but it seems fine now. i'll roll testwikis and then group0 shortly after [19:06:38] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251#11648006 (10Jgreen) 05Open→03In progress p:05Triage→03Medium a:03Jgreen [19:07:00] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251#11648009 (10Jgreen) [19:07:03] dduvall: sounds good [19:08:02] RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [19:08:02] RECOVERY - Confd vcl based reload on cp2033 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [19:08:14] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243197 (https://phabricator.wikimedia.org/T413808) [19:08:16] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243197 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [19:08:30] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [19:09:52] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating records after renaming and moving vlan of some an-worker hosts - btullis@cumin1003" [19:09:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating records after renaming and moving vlan of some an-worker hosts - btullis@cumin1003" [19:09:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:02] PROBLEM - Confd vcl based reload on cp2031 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:10:10] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243197 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [19:10:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:11:37] (03PS1) 10BCornwall: kafka::webrequest: Only use varnishkafka when on [puppet] - 10https://gerrit.wikimedia.org/r/1243199 [19:11:37] (03PS1) 10BCornwall: kafka::webrequest: Tighten monitoring guard [puppet] - 10https://gerrit.wikimedia.org/r/1243200 [19:12:09] !log dduvall@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.17 refs T413808 [19:12:13] T413808: 1.46.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T413808 [19:13:50] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1024 [19:13:56] (03CR) 10CI reject: [V:04-1] kafka::webrequest: Only use varnishkafka when on [puppet] - 10https://gerrit.wikimedia.org/r/1243199 (owner: 10BCornwall) [19:14:15] !log btullis@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host dse-k8s-worker1024 [19:14:21] (03CR) 10CI reject: [V:04-1] kafka::webrequest: Tighten monitoring guard [puppet] - 10https://gerrit.wikimedia.org/r/1243200 (owner: 10BCornwall) [19:14:38] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1024 [19:15:50] (03PS2) 10BCornwall: kafka::webrequest: Tighten monitoring guard [puppet] - 10https://gerrit.wikimedia.org/r/1243200 [19:15:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1024 [19:15:59] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1025 [19:17:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1025 [19:18:14] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1026 [19:18:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1026 [19:18:39] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1027 [19:18:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1027 [19:18:57] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1028 [19:19:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1028 [19:20:11] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:20:56] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8138/console" [puppet] - 10https://gerrit.wikimedia.org/r/1243200 (owner: 10BCornwall) [19:25:08] btullis@cumin1003 provision (PID 1103540) is awaiting input [19:25:37] (03CR) 10Dzahn: "ACK! I made https://phabricator.wikimedia.org/T418299 just now to track that." [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:25:57] (03CR) 10Dzahn: [C:03+2] releases: upgrade Java version from 17 to 21 [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:26:04] (03PS4) 10Dzahn: releases: upgrade Java version from 17 to 21 [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) [19:26:52] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:27:31] (03PS2) 10BCornwall: varnishkafka: Only enable prom exporter for text [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) [19:27:56] (03CR) 10Dzahn: [C:03+2] releases: upgrade Java version from 17 to 21 [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:28:18] (03PS3) 10BCornwall: varnishkafka: Only enable for text [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) [19:29:41] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8139/co" [puppet] - 10https://gerrit.wikimedia.org/r/1243195 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [19:30:41] (03CR) 10Dzahn: [C:03+2] "and .. it failed." [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:32:18] (03CR) 10Dzahn: [C:03+2] "E: The list of sources could not be read" [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:33:04] (03Abandoned) 10BCornwall: kafka::webrequest: Only use varnishkafka when on [puppet] - 10https://gerrit.wikimedia.org/r/1243199 (owner: 10BCornwall) [19:33:20] (03CR) 10Dzahn: [C:03+2] "E: Conflicting values set for option Signed-By regarding source http://apt.wikimedia.org/wikimedia/ bookworm-wikimedia: /etc/apt/keyrings/" [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:33:30] (03PS1) 10Joal: Extend webrequest and other data retention [puppet] - 10https://gerrit.wikimedia.org/r/1243205 (https://phabricator.wikimedia.org/T418162) [19:34:15] (03PS1) 10Dzahn: Revert "releases: upgrade Java version from 17 to 21" [puppet] - 10https://gerrit.wikimedia.org/r/1243206 [19:34:29] (03CR) 10Dzahn: [C:03+2] Revert "releases: upgrade Java version from 17 to 21" [puppet] - 10https://gerrit.wikimedia.org/r/1243206 (owner: 10Dzahn) [19:38:45] (03CR) 10JHathaway: [C:03+2] dmarc: remove unused ruf tags [dns] - 10https://gerrit.wikimedia.org/r/1243174 (https://phabricator.wikimedia.org/T417941) (owner: 10JHathaway) [19:39:54] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251#11648202 (10Jgreen) [19:40:02] !log jhathaway@dns1004 START - running authdns-update [19:41:28] !log jhathaway@dns1004 END - running authdns-update [19:43:11] (03CR) 10JavierMonton: [C:03+1] Extend webrequest and other data retention [puppet] - 10https://gerrit.wikimedia.org/r/1243205 (https://phabricator.wikimedia.org/T418162) (owner: 10Joal) [19:56:48] !log dduvall@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.17 refs T413808 (duration: 44m 39s) [19:56:52] T413808: 1.46.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T413808 [19:58:57] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243211 (https://phabricator.wikimedia.org/T413808) [19:58:59] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243211 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [20:00:00] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243211 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [20:08:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:08:30] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.17 refs T413808 [20:08:34] T413808: 1.46.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T413808 [20:19:59] (03CR) 10Btullis: [C:03+2] Extend webrequest and other data retention [puppet] - 10https://gerrit.wikimedia.org/r/1243205 (https://phabricator.wikimedia.org/T418162) (owner: 10Joal) [20:23:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:23:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:36:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:39:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:42:18] (03CR) 10RLazarus: [C:03+1] "LGTM as discussed offline -- this already should've been like this, and I agree it looks like a no-op for non-drain-related cases like tra" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [20:42:43] done with train [20:48:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:50:50] (03CR) 10RLazarus: [C:03+1] "This is starting to feel like it's straining the boundaries of what's reasonable to do in bash before rewriting it in Python. The changes " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [20:53:48] (03PS1) 10Aqu: Bump Blunderbuss image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243221 (https://phabricator.wikimedia.org/T415874) [20:55:44] (03PS2) 10Aqu: Bump Blunderbuss image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243221 (https://phabricator.wikimedia.org/T415874) [20:55:44] (03CR) 10RLazarus: [C:03+1] "This doesn't need to touch package.json as it's a patch version only, but `sextant update` would update the version numbers in package.loc" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T2100). [21:00:05] AaronSchulz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:49] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [21:05:51] (03PS4) 10Pppery: Add Comments namespace for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [21:06:37] (03CR) 10CI reject: [V:04-1] Add Comments namespace for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [21:07:25] (03PS5) 10Pppery: Add Comments namespace for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [21:09:50] guess it's just me [21:11:15] (03PS27) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:11:54] (03CR) 10CI reject: [V:04-1] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:14:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224253 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [21:14:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224228 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [21:15:52] (03Merged) 10jenkins-bot: Switch math sandbox specs to plain wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224253 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [21:16:04] (03Merged) 10jenkins-bot: Copy rest_v1-wikimedia.json to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224228 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [21:16:23] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1224253|Switch math sandbox specs to plain wikimedia.org (T418188)]], [[gerrit:1224228|Copy rest_v1-wikimedia.json to standard-docroot (T418188)]] [21:16:28] T418188: Simplify static Restbase json spec file configuration - https://phabricator.wikimedia.org/T418188 [21:17:48] Hey folks. We've noticed the mirror of ubuntu looks about 22 days behind. Does that seem correct? I am basing this on "curl -sI "https://mirrors.wikimedia.org/ubuntu/dists/jammy-updates/Release" | grep -i last-modified" [21:19:02] (03PS28) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:19:03] !log aaron@deploy2002 aaron: Backport for [[gerrit:1224253|Switch math sandbox specs to plain wikimedia.org (T418188)]], [[gerrit:1224228|Copy rest_v1-wikimedia.json to standard-docroot (T418188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:19:45] !log aaron@deploy2002 aaron: Continuing with sync [21:22:26] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11648602 (10AKanji-WMF) [21:23:43] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224253|Switch math sandbox specs to plain wikimedia.org (T418188)]], [[gerrit:1224228|Copy rest_v1-wikimedia.json to standard-docroot (T418188)]] (duration: 07m 20s) [21:23:47] T418188: Simplify static Restbase json spec file configuration - https://phabricator.wikimedia.org/T418188 [21:24:02] done [21:28:00] (03PS1) 10JHathaway: dmarc: add dmarc records for domains which do not send email [dns] - 10https://gerrit.wikimedia.org/r/1243225 [21:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:37:05] (03CR) 10BCornwall: [C:03+1] dmarc: add dmarc records for domains which do not send email [dns] - 10https://gerrit.wikimedia.org/r/1243225 (owner: 10JHathaway) [21:40:18] (03PS29) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:41:19] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [21:42:05] RECOVERY - Confd vcl based reload on cp2031 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [21:42:24] (03CR) 10CI reject: [V:04-1] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:45:55] (03PS30) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:48:01] (03CR) 10CI reject: [V:04-1] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:53:24] (03CR) 10RLazarus: [C:03+1] "Just to add a little excitement: I had a moment of uncertainty whether the `env` map in the container spec is applied to the lifecycle hoo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [21:59:46] (03PS31) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T2200) [22:05:16] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:06:36] (03CR) 10RLazarus: [C:03+1] envoy: Allow inboundonly drain and support min wait time (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [22:07:03] (03CR) 10CDobbins: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:14:39] (03CR) 10RLazarus: [C:03+1] "Helm diffs look good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [22:22:52] (03CR) 10Cwhite: [C:03+1] "LGTM from my side! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1237215 (https://phabricator.wikimedia.org/T255568) (owner: 10Majavah) [22:24:40] 06SRE, 10Observability-Metrics, 10Prod-Kubernetes, 06ServiceOps new, 06SRE Observability (FY2025/2026-Q3): write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11648811 (10colewhite) [22:28:04] (03PS3) 10Hashar: Revert^2 "Gerrit: Disable auto reloading replication config" [puppet] - 10https://gerrit.wikimedia.org/r/1238043 (https://phabricator.wikimedia.org/T416929) [22:29:24] (03CR) 10Hashar: "I have removed the link to T379714 which is "Upgrade to Gerrit 3.11" it is unrelated. Though MAYBE the replication plugin has a fix for t" [puppet] - 10https://gerrit.wikimedia.org/r/1238043 (https://phabricator.wikimedia.org/T416929) (owner: 10Hashar) [22:29:51] jouncebot: nowandnext [22:29:51] For the next 0 hour(s) and 30 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T2200) [22:29:51] In 8 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T0700) [22:30:01] I need to restart Gerrit [22:31:40] (03CR) 10BCornwall: "Really close!" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:35:02] !log Restarted Gerrit due to a replication config issue [22:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:20] !log import ncmonitor 3.1.0~deb13u1 into trixie-wikimedia (T401832) [22:37:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:25] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [22:37:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:40:39] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS trixie [22:43:57] (03CR) 10Cwhite: [C:03+1] mtail: Use the Debian version of mtail universally [puppet] - 10https://gerrit.wikimedia.org/r/1243048 (owner: 10Muehlenhoff) [22:45:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:47:23] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:52:13] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage [22:58:23] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:58:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:58:32] !log [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'A:wdqs-main AND P{wdqs1*}' 'systemctl restart wdqs-blazegraph'` [22:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage [23:04:52] !log [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'A:wdqs-main AND P{wdqs2*} AND NOT P{wdqs2012*}' 'systemctl restart wdqs-blazegraph'` (2012 still seems healthy, rest are all not) [23:04:52] (03CR) 10Scott French: "Thanks, Moritz." [puppet] - 10https://gerrit.wikimedia.org/r/1243166 (owner: 10Muehlenhoff) [23:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:05:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:08:25] (03PS7) 10Aaron Schulz: trafficserver: cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631 [23:08:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:09:08] (03PS2) 10Aaron Schulz: Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) [23:11:27] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:19:50] (03PS1) 10Hashar: gerrit: update gerrit2002 after reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1243257 (https://phabricator.wikimedia.org/T417247) [23:24:39] (03PS2) 10Scott French: envoy: Allow inboundonly drain and support min wait time [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) [23:26:22] (03PS1) 10BCornwall: ncmonitor: Add ncmonitor sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1243258 [23:28:27] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:28:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:28:38] (03PS2) 10BCornwall: ncmonitor: Add ncmonitor sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1243258 [23:29:39] (03CR) 10Scott French: "100% agreed. On balance, I'm hopeful that we can get rid of this with the transition to sidecar containers, since this is actually somethi" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [23:29:52] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncmonitor1001.eqiad.wmnet with OS trixie [23:31:25] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8141/co" [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall) [23:33:02] (03CR) 10Bking: [C:03+2] gerrit: update gerrit2002 after reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1243257 (https://phabricator.wikimedia.org/T417247) (owner: 10Hashar) [23:35:01] (03CR) 10Scott French: [V:03+2] "Built and verified against local envoy test setup (again)." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [23:35:23] (03CR) 10RLazarus: [C:03+1] envoy: Allow inboundonly drain and support min wait time [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [23:36:43] (03CR) 10ArielGlenn: "I think (with a caveat, see the one comment) that this is ok. I am uneasy that we don't have an end to end test for this in staging, which" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [23:38:20] (03CR) 10Scott French: [V:03+2 C:03+2] envoy: Allow inboundonly drain and support min wait time [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [23:41:32] !log built envoy images (1.35.7-3) - T364245 [23:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:36] T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis - https://phabricator.wikimedia.org/T364245 [23:42:17] (03PS3) 10BCornwall: ncmonitor: Add ncmonitor sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1243258 [23:43:03] (03CR) 10CI reject: [V:04-1] ncmonitor: Add ncmonitor sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall) [23:45:33] (03PS4) 10BCornwall: ncmonitor: Add ncmonitor sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1243258 [23:47:33] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8143/co" [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall)