[00:00:41] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1269720|Disable query pages on testcommonswiki not compatible with split (T421914)]] [00:00:45] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [00:02:25] !log zabe@deploy1003 zabe: Backport for [[gerrit:1269720|Disable query pages on testcommonswiki not compatible with split (T421914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:03:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-be2005.codfw.wmnet with reason: host reimage [00:04:13] !log zabe@deploy1003 zabe: Continuing with sync [00:07:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:07:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:07:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be2006.codfw.wmnet with OS bookworm [00:07:53] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11806988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-be2006.codfw.wmnet with OS bookworm comp... [00:07:58] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269720|Disable query pages on testcommonswiki not compatible with split (T421914)]] (duration: 07m 17s) [00:08:01] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [00:21:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:24:24] jhancock@cumin2002 reimage (PID 1591632) is awaiting input [00:26:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:26:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be2005.codfw.wmnet with OS bookworm [00:26:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11807025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-be2005.codfw.wmnet with OS bookworm comp... [00:29:27] !log marked 425 content rows as bad # T393237 [00:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:30] T393237: Some en.wikipedia pageviews fatal "RevisionAccessException: Failed to load data blob from {address} for revision {revision}." - https://phabricator.wikimedia.org/T393237 [00:33:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:19] (03PS1) 10Zabe: Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914) [00:37:25] (03PS2) 10Zabe: Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914) [00:39:25] (03CR) 10Zabe: [C:03+2] Start reading from new file tables on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269086 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:40:19] (03Merged) 10jenkins-bot: Start reading from new file tables on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269086 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:40:38] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1269086|Start reading from new file tables on enwiki (T416548)]] [00:40:41] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:41:56] (03CR) 10Ladsgroup: [C:03+1] Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [00:42:21] !log zabe@deploy1003 zabe: Backport for [[gerrit:1269086|Start reading from new file tables on enwiki (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:43:04] !log zabe@deploy1003 zabe: Continuing with sync [00:46:49] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269086|Start reading from new file tables on enwiki (T416548)]] (duration: 06m 11s) [00:46:53] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:47:22] (03CR) 10Zabe: [C:03+2] Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [00:48:06] (03PS15) 10Ryan Kemper: cloudelastic: Prepare for opensearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [00:48:13] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [00:48:14] (03Merged) 10jenkins-bot: Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [00:48:25] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11807070 (10Jhancock.wm) [00:48:33] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1269744|Stop setting specific virtual domain for link tables (T421914)]] [00:48:36] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [00:48:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11807072 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @MatthewVernon all yours [00:50:20] !log zabe@deploy1003 zabe: Backport for [[gerrit:1269744|Stop setting specific virtual domain for link tables (T421914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:50:39] !log zabe@deploy1003 zabe: Continuing with sync [00:51:37] (03CR) 10RLazarus: [C:03+2] function-{evaluator,orchestrator}: set AppArmor profile in container SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [00:53:50] (03Merged) 10jenkins-bot: function-{evaluator,orchestrator}: set AppArmor profile in container SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [00:54:24] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269744|Stop setting specific virtual domain for link tables (T421914)]] (duration: 05m 51s) [00:54:27] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [00:57:47] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:57:56] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [01:09:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755 [01:09:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755 (owner: 10TrainBranchBot) [01:20:46] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755 (owner: 10TrainBranchBot) [01:22:59] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [01:23:55] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [01:25:43] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [01:26:21] (03CR) 10Ecarg: [C:03+2] wikifunctions: Stop testing the v1 orchestrator endpoint, we're dropping it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269072 (https://phabricator.wikimedia.org/T421768) (owner: 10Jforrester) [01:26:36] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [01:28:21] (03Merged) 10jenkins-bot: wikifunctions: Stop testing the v1 orchestrator endpoint, we're dropping it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269072 (https://phabricator.wikimedia.org/T421768) (owner: 10Jforrester) [01:30:16] (03PS1) 10Zabe: Set $wgGlobalUsageSharedRepoWiki for testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) [01:30:28] (03CR) 10Zabe: "Do we want this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [01:31:40] (03PS1) 10Zabe: Also disable updates for GloballyWantedFiles on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269759 (https://phabricator.wikimedia.org/T421914) [03:05:41] (03CR) 10Anzx: Drop 1.5x logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [03:58:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:26:09] (03PS2) 10Ryan Kemper: growthbook: Add API key placeholders for automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) [05:03:59] (03CR) 10Marostegui: [C:03+2] installserver: Wipe clouddb1019 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1269467 (https://phabricator.wikimedia.org/T422813) (owner: 10Marostegui) [05:05:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie [05:06:02] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 3 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11807371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie [05:10:23] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 3 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11807373 (10Marostegui) p:05Triage→03Medium [05:40:40] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11807424 (10Marostegui) @Jclark-ctr I am not able to reimage the host, it is not rebooting, can you check onsite what's on the screen? I've tried several times to reboot it... [05:45:05] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:46:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:46:47] (03PS1) 10Marostegui: clouddb1019: Adding a note [puppet] - 10https://gerrit.wikimedia.org/r/1269841 [05:46:59] (03CR) 10Marostegui: "This is a noop - a note for future usage" [puppet] - 10https://gerrit.wikimedia.org/r/1269841 (owner: 10Marostegui) [05:47:05] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:47:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:47:24] (03CR) 10Marostegui: [C:03+2] clouddb1019: Adding a note [puppet] - 10https://gerrit.wikimedia.org/r/1269841 (owner: 10Marostegui) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260410T0600) [06:02:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:26:51] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1019.eqiad.wmnet with OS trixie [06:27:04] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11807473 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors: -... [06:37:15] 06SRE, 10SRE-Access-Requests: Requesting access to stats hosts for Daniel Kinzler - https://phabricator.wikimedia.org/T422827#11807481 (10daniel) The SSH key is the one that I also use for access to the deployment hosts, is that ok? [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260410T0700) [07:09:06] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on gitlab1003.wikimedia.org with reason: Upgrade [07:28:05] (03CR) 10Jelto: [C:03+2] "that's a relatively new service from Nat and probably also just read-only but I have to double check with Nat." [dns] - 10https://gerrit.wikimedia.org/r/1269452 (https://phabricator.wikimedia.org/T422819) (owner: 10Jelto) [07:29:21] !log jelto@dns1004 START - running authdns-update [07:30:47] !log jelto@dns1004 END - running authdns-update [07:50:16] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11807564 (10MatthewVernon) I've eyeballed the discussion here - AFAICT apus is behaving as expected? I have... [07:57:36] (03CR) 10Brouberol: [C:03+1] growthbook: Add API key placeholders for automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper) [07:58:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:15] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:09:14] (03PS1) 10Jelto: gitlab::rsync: add misssing ensure [puppet] - 10https://gerrit.wikimedia.org/r/1269865 (https://phabricator.wikimedia.org/T422858) [08:12:47] (03PS8) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [08:12:47] (03PS7) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [08:13:09] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8405/console" [puppet] - 10https://gerrit.wikimedia.org/r/1269865 (https://phabricator.wikimedia.org/T422858) (owner: 10Jelto) [08:13:14] (03CR) 10Elukey: Move linting to Ruff and apply code fixes (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:14:05] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:14:10] (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1266959 (owner: 10Muehlenhoff) [08:14:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:16:07] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab::rsync: add misssing ensure [puppet] - 10https://gerrit.wikimedia.org/r/1269865 (https://phabricator.wikimedia.org/T422858) (owner: 10Jelto) [08:16:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:17:11] (03PS1) 10Elukey: aux-k8s-services: update Jaeger's Istio DR after k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269868 (https://phabricator.wikimedia.org/T414486) [08:19:25] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:19:48] (03CR) 10CI reject: [V:04-1] tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:21:14] (03CR) 10CI reject: [V:04-1] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:22:05] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:24:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:25:05] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:27:37] (03PS1) 10Tiziano Fogli: thanos/compact: reduce concurrency due to disk constraints [puppet] - 10https://gerrit.wikimedia.org/r/1269955 (https://phabricator.wikimedia.org/T386911) [08:29:02] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: reduce concurrency due to disk constraints [puppet] - 10https://gerrit.wikimedia.org/r/1269955 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [08:32:05] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:32:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:33:21] (03CR) 10Clément Goubert: "One last small change and then lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [08:33:48] (03PS1) 10MVernon: apus: add two new storage nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902) [08:34:12] (03PS2) 10MVernon: apus: add two new storage nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902) [08:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:49] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [08:38:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11807714 (10MatthewVernon) Thanks @Jhancock.wm :) [08:41:11] (03CR) 10Federico Ceratto: [C:03+1] "The hostnames match the description and the related task. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [08:42:44] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:44:32] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:44:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:45:00] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:45:13] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:45:16] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventgate-main.svc.codfw.wmnet:4492 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:50:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventgate-main.svc.codfw.wmnet:4492 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:54:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:54:18] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:54:31] (03CR) 10Hashar: "That one failed due to T422907 , I have reverted Vector patch https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1269962" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755 (owner: 10TrainBranchBot) [08:54:33] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:54:40] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:55:00] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:00:24] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:01:29] (03PS9) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [09:01:29] (03PS8) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [09:01:29] (03PS1) 10Elukey: tests: fix icinga tests when running on py3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968 [09:02:21] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:02:58] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:04:41] (03PS1) 10Federico Ceratto: admin: Add second U2F key, remove non-U2F SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1269970 [09:04:41] (03CR) 10Federico Ceratto: "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1269970 (owner: 10Federico Ceratto) [09:04:46] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:06:17] (03CR) 10JMeybohm: [C:03+1] aux-k8s-services: update Jaeger's Istio DR after k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269868 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [09:07:52] (03CR) 10CI reject: [V:04-1] tests: fix icinga tests when running on py3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968 (owner: 10Elukey) [09:13:02] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:14:49] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:15:22] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:16:25] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [09:16:59] (03CR) 10Clément Goubert: [C:03+2] Revert "rest-gateway: Add api.w.o device-analytics support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 (owner: 10Clément Goubert) [09:17:12] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:17:14] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2147 (T419635)', diff saved to https://phabricator.wikimedia.org/P90346 and previous config saved to /var/cache/conftool/dbconfig/20260410-091713-fceratto.json [09:17:17] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:18:06] (03PS2) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) [09:18:34] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:19:01] (03Merged) 10jenkins-bot: Revert "rest-gateway: Add api.w.o device-analytics support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 (owner: 10Clément Goubert) [09:19:04] (03CR) 10Elukey: [C:03+2] aux-k8s-services: update Jaeger's Istio DR after k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269868 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [09:21:13] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:21:22] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:21:32] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:21:48] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:22:06] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:22:21] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:24:19] (03PS1) 10Andrew McAllister (WMDE): Allow WMDE Airflow instance to egress to dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269978 (https://phabricator.wikimedia.org/T414583) [09:24:21] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: sync [09:24:27] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: sync [09:24:28] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:24:37] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: sync [09:24:46] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: sync [09:24:47] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:26:04] (03PS2) 10Elukey: tests: fix icinga tests when running on py3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968 [09:26:04] (03PS10) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [09:26:04] (03PS9) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [09:29:38] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:29:48] (03CR) 10CI reject: [V:04-1] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:30:11] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:30:38] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968 (owner: 10Elukey) [09:31:09] (03CR) 10Volans: Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:31:32] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:34:14] (03CR) 10Elukey: [C:03+2] tests: fix icinga tests when running on py3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968 (owner: 10Elukey) [09:34:46] (03CR) 10Elukey: [C:03+2] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:35:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:36:03] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:39:49] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:41:27] (03PS3) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) [09:42:53] (03Merged) 10jenkins-bot: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:43:02] (03CR) 10Atsuko: "I addressed the comments and also elaborated on the decision for not adding the script for default deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [09:45:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:48:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:48:20] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:50:06] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:53:10] 06SRE: Add atsuko to icinga - https://phabricator.wikimedia.org/T422921 (10atsuko) 03NEW [09:54:33] (03PS1) 10Volans: reposync: fix unit test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269986 [09:55:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:21] (03PS1) 10Atsuko: icinga: add Atsuko Ito to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1269987 (https://phabricator.wikimedia.org/T422921) [09:57:54] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:58:00] (03PS1) 10Fabfur: cache::aptrepo: restore haproxy28 component and update [puppet] - 10https://gerrit.wikimedia.org/r/1269988 [09:58:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:00:29] (03CR) 10Elukey: [C:03+1] reposync: fix unit test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269986 (owner: 10Volans) [10:01:20] (03CR) 10Volans: [C:03+2] reposync: fix unit test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269986 (owner: 10Volans) [10:02:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:28] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T422668 [10:04:43] 06SRE, 10Wikimedia-Mailing-lists, 07Sustainability (Incident Followup): lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#11808033 (10MLechvien-WMF) a:03Ladsgroup @Ladsgroup This is not an Active Investigation and qualifies more for an Follow up Action... [10:05:51] (03PS3) 10Majavah: hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979 [10:07:51] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:08:03] (03Merged) 10jenkins-bot: reposync: fix unit test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269986 (owner: 10Volans) [10:09:40] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:09:47] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:10:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:10:23] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:10:36] (03PS2) 10Vgutierrez: cache::aptrepo: restore haproxy28 component and update [puppet] - 10https://gerrit.wikimedia.org/r/1269988 (https://phabricator.wikimedia.org/T422926) (owner: 10Fabfur) [10:10:55] (03PS1) 10Elukey: haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 [10:11:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:11:16] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:11:36] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:12:22] (03CR) 10Vgutierrez: [C:03+2] cache::aptrepo: restore haproxy28 component and update [puppet] - 10https://gerrit.wikimedia.org/r/1269988 (https://phabricator.wikimedia.org/T422926) (owner: 10Fabfur) [10:13:45] (03CR) 10Brouberol: [C:03+1] "LG!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269978 (https://phabricator.wikimedia.org/T414583) (owner: 10Andrew McAllister (WMDE)) [10:14:29] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:16:18] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:16:24] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:18:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:18:16] (03CR) 10Elukey: "$ docker run -ti docker-registry.wikimedia.org/haproxy:3.2.15-1 -vv" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (owner: 10Elukey) [10:19:06] !log upload haproxy 2.8.20 to thirdparty/haproxy28 for bookworm-wikimedia (apt.wm.o) - T422926 [10:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:10] T422926: Thumbor is using an unmantained HAProxy version - https://phabricator.wikimedia.org/T422926 [10:19:24] (03CR) 10Brouberol: [C:03+1] icinga: add Atsuko Ito to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1269987 (https://phabricator.wikimedia.org/T422921) (owner: 10Atsuko) [10:20:14] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:20:22] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:21:13] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [10:25:44] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:27:32] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:27:39] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:30:00] (03PS2) 10Clément Goubert: haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [10:30:04] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:31:31] (03CR) 10Brouberol: airflow: dag filter helper function (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [10:35:32] (03PS1) 10Elukey: istio: revisit Prometheus buckets for ML's gateway/sidecar sources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) [10:37:22] (03CR) 10Elukey: "Kicked off the conversation with some high level values, lemme know if you want to change them further. My goal is to have a standard that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [10:43:50] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11808224 (10FCeratto-WMF) That's pretty much the issue to discuss: we have only very few warnings on IRC (not pages) as datapoints and no immediate way to simulate... [10:53:48] (03CR) 10Vgutierrez: [C:03+1] "from a traffic PoV this makes more sense than maintaining a 2.8 version with bookworm given that we get rid of OpenSSL 3.0 in favor of 3.5" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [10:58:12] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:00:02] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260410T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260410T1100). [11:00:38] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:01:32] aokoth@cumin1003 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade. [11:02:29] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:02:44] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:03:32] FIRING: SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:04:41] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:04:59] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:06:36] (03CR) 10Atsuko: "all addressed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [11:06:55] (03PS4) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) [11:07:01] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:08:03] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 116992 bytes in 2.368 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:08:32] RESOLVED: SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:10:47] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T422668 [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:54] (03PS5) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) [11:13:37] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [11:14:59] (03CR) 10Atsuko: [C:03+2] icinga: add Atsuko Ito to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1269987 (https://phabricator.wikimedia.org/T422921) (owner: 10Atsuko) [11:15:02] FIRING: [3x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:15:05] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:16:52] (03CR) 10Atsuko: [C:03+2] airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [11:16:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:16:55] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T419635)', diff saved to https://phabricator.wikimedia.org/P90347 and previous config saved to /var/cache/conftool/dbconfig/20260410-111654-fceratto.json [11:16:58] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:19:08] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:19:40] (03Merged) 10jenkins-bot: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [11:20:02] FIRING: [5x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:20:59] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:22:54] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:25:02] FIRING: [7x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:27:43] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P90348 and previous config saved to /var/cache/conftool/dbconfig/20260410-112742-fceratto.json [11:30:02] RESOLVED: [7x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:30:57] 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11808307 (10NakavoDev) >>! In T422872#11806792, @Reedy wrote: > By you extract one of the links... What do you mean? > > Are you always getting thumbs? Or are you sometimes (often?) requesting the originals based on siz... [11:33:40] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11808312 (10Marostegui) I'm on call next week so we can force a lag page [11:37:42] (03PS1) 10Aude: Opt-in new accounts to ReadingLists beta feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) [11:38:32] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P90349 and previous config saved to /var/cache/conftool/dbconfig/20260410-113830-fceratto.json [11:40:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [11:49:21] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T419635)', diff saved to https://phabricator.wikimedia.org/P90350 and previous config saved to /var/cache/conftool/dbconfig/20260410-114919-fceratto.json [11:49:25] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:49:28] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [11:50:12] 06SRE, 10Observability-Alerting, 07Sustainability (Incident Followup): Paging alert on combination of hosts down and a BGP outage - https://phabricator.wikimedia.org/T417051#11808348 (10MLechvien-WMF) tagging #sre_observability as I'm not sure the Alerting tag is used [11:50:17] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2155 (T419635)', diff saved to https://phabricator.wikimedia.org/P90351 and previous config saved to /var/cache/conftool/dbconfig/20260410-115015-fceratto.json [11:58:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:09:18] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:15:17] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:22:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:24:08] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:24:16] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:25:45] 06SRE: Add atsuko to icinga - https://phabricator.wikimedia.org/T422921#11808463 (10atsuko) 05Open→03In progress [12:26:10] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:27:38] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:32:24] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:33:38] 06SRE: Add atsuko to icinga - https://phabricator.wikimedia.org/T422921#11808476 (10atsuko) 05In progress→03Resolved puppet changes: merged puppet-private changes: `8af215ff3c0c08599a9e52ff5855d197a835c418` [12:34:16] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:37:06] (03PS1) 10Brouberol: growthbook-next: deploy an image containing an unreleased feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270028 (https://phabricator.wikimedia.org/T420781) [12:39:41] (03CR) 10Santiago Faci: [C:03+2] growthbook-next: deploy an image containing an unreleased feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270028 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol) [12:41:35] (03Merged) 10jenkins-bot: growthbook-next: deploy an image containing an unreleased feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270028 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol) [12:44:42] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:46:32] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:47:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [12:51:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:52:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [12:54:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:54:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:55:37] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:57:13] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:57:26] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:57:33] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:57:46] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [12:57:46] (03PS1) 10Ladsgroup: envoy: Close connections to swift after 10s of inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1270031 (https://phabricator.wikimedia.org/T328872) [12:59:23] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:01:40] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove dns for decom lumen transport cct - cmooney@cumin1003" [13:01:45] (03PS1) 10Cathal Mooney: Remove include for 2620:0:861:fe06::/64 link range [dns] - 10https://gerrit.wikimedia.org/r/1270032 (https://phabricator.wikimedia.org/T395878) [13:02:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove dns for decom lumen transport cct - cmooney@cumin1003" [13:02:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:02:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:43] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox [13:06:36] (03CR) 10Cathal Mooney: [C:03+2] Remove include for 2620:0:861:fe06::/64 link range [dns] - 10https://gerrit.wikimedia.org/r/1270032 (https://phabricator.wikimedia.org/T395878) (owner: 10Cathal Mooney) [13:06:53] !log cmooney@dns2005 START - running authdns-update [13:07:11] !log cmooney@dns2005 START - running authdns-update [13:08:15] !log cmooney@dns2005 END - running authdns-update [13:12:04] (03CR) 10Clément Goubert: [C:03+1] envoy: Close connections to swift after 10s of inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1270031 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [13:15:10] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:16:59] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:19:50] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:19:53] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:20:11] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:21:12] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:21:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T410589)', diff saved to https://phabricator.wikimedia.org/P90357 and previous config saved to /var/cache/conftool/dbconfig/20260410-132119-ladsgroup.json [13:21:23] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:21:41] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:22:08] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [13:22:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T410589)', diff saved to https://phabricator.wikimedia.org/P90358 and previous config saved to /var/cache/conftool/dbconfig/20260410-132215-ladsgroup.json [13:28:21] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:29:54] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:30:11] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:30:12] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:30:24] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:30:28] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:32:16] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:32:21] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:32:36] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:36:33] (03CR) 10Filippo Giunchedi: [C:03+1] "I don't feel strongly either way tbh, I'd say let's go for it for now" [puppet] - 10https://gerrit.wikimedia.org/r/1268979 (owner: 10Majavah) [13:36:38] (03PS10) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) [13:37:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11808649 (10ssingh) [13:40:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:44:35] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:44:40] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:49:36] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T419635)', diff saved to https://phabricator.wikimedia.org/P90362 and previous config saved to /var/cache/conftool/dbconfig/20260410-134935-fceratto.json [13:49:40] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:00:25] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P90363 and previous config saved to /var/cache/conftool/dbconfig/20260410-140023-fceratto.json [14:01:41] (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270038 [14:08:12] (03CR) 10Bking: [C:03+2] cloudelastic: Prepare for opensearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [14:10:21] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:10:22] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:10:56] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:11:13] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P90365 and previous config saved to /var/cache/conftool/dbconfig/20260410-141112-fceratto.json [14:11:27] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:11:50] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:12:18] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:13:08] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:13:27] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:13:46] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:13:58] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:17:51] (03CR) 10JMeybohm: "Feel free to also loop in the friendly folks from netops ( @cmooney@wikimedia.org || @ayounsi@wikimedia.org) to double check if these quer" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [14:19:09] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270038 (owner: 10Gkyziridis) [14:19:40] 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11808793 (10Reedy) Do you know what stats of thumbnail vs original you’re requesting? Generally, thumbnails are definitely preferred, so if you’re preferring original because it’s first match, that will start to explain... [14:21:23] (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270038 (owner: 10Gkyziridis) [14:22:02] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T419635)', diff saved to https://phabricator.wikimedia.org/P90366 and previous config saved to /var/cache/conftool/dbconfig/20260410-142200-fceratto.json [14:22:05] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:22:21] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [14:23:09] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90367 and previous config saved to /var/cache/conftool/dbconfig/20260410-142308-fceratto.json [14:25:09] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:27:00] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:30:40] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:30:48] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:32:29] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:35:27] (03PS1) 10Aude: Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) [14:35:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude) [14:36:24] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:36:30] (03PS11) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) [14:38:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:38:41] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:38:50] (03CR) 10CI reject: [V:04-1] kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [14:40:17] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:40:52] (03PS12) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) [14:41:27] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO) [14:44:02] (03CR) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [14:45:25] 06SRE-OnFire, 10corto: Cortobot help command should not spam the main channel - https://phabricator.wikimedia.org/T421858#11808871 (10Peachey88) [14:45:26] 06SRE-OnFire, 10corto: Harden corto systemd service - https://phabricator.wikimedia.org/T372437#11808872 (10Peachey88) [14:45:28] 06SRE-OnFire, 10corto, 10Incident Tooling: corto: track responders - https://phabricator.wikimedia.org/T391897#11808873 (10Peachey88) [14:45:31] 06SRE-OnFire, 10corto, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785#11808874 (10Peachey88) [14:45:32] 06SRE-OnFire, 10corto, 10Incident Tooling: Corto: Functional & Integration testing - https://phabricator.wikimedia.org/T377036#11808875 (10Peachey88) [14:45:58] (03PS13) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) [14:48:32] 06SRE, 10SRE-Access-Requests: Requesting access to stats hosts for Daniel Kinzler - https://phabricator.wikimedia.org/T422827#11808892 (10daniel) 05Open→03Invalid Turns out I'm already in the relevant LDAP group. I filed a separate ticket for the Kerberos credentials: T422947 [14:53:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:54:15] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:57:42] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11808913 (10taavi) 05Open→03Resolved a:03ABran-WMF The patch is merged... [15:00:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:05:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:08:11] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [15:12:11] (03CR) 10Anzx: [C:04-1] "create new patch for deleting logo files and schedule it a week later, because some files may be cached it would be safe delete after a fe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [15:14:12] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11808965 (10Scott_French) >>! In T422166#11807564, @MatthewVernon wrote: > I've eyeballed the discussion he... [15:16:19] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie [15:17:14] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephmon1006.eqiad.wmnet [15:20:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:23:57] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon1006.eqiad.wmnet [15:26:57] 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11808983 (10NakavoDev) >>! In T422872#11808793, @Reedy wrote: > Do you know what stats of thumbnail vs original you’re requesting? > > Generally, thumbnails are definitely preferred, so if you’re preferring original bec... [15:27:26] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [15:29:00] (03CR) 10Dzahn: [C:03+1] gitlab::rsync: add misssing ensure [puppet] - 10https://gerrit.wikimedia.org/r/1269865 (https://phabricator.wikimedia.org/T422858) (owner: 10Jelto) [15:34:16] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [15:34:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:35:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:36:25] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:38:37] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:38:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:39:28] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephmon1006.eqiad.wmnet [15:41:12] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1012.eqiad.wmnet with OS trixie [15:44:57] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:45:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:45:57] (03PS1) 10Scott French: wikikube: Remove comment on coredns replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270056 (https://phabricator.wikimedia.org/T422455) [15:46:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon1006.eqiad.wmnet [15:49:07] (03CR) 10Cathal Mooney: [C:03+1] "Overall LGTM, few comments in-line." [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [15:49:16] (03CR) 10JMeybohm: [C:03+1] wikikube: Remove comment on coredns replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270056 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French) [15:55:22] (03CR) 10Stoyofuku-wmf: [C:03+1] "I never learned what our timestamp format is - is that 1pm UTC? Either way, this looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [15:58:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:01] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:59:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [16:01:13] (03CR) 10Herron: [C:03+1] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite) [16:05:34] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox [16:07:24] (03CR) 10Aude: "yes it is 1pm UTC which is start of the backport window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [16:08:28] (03CR) 10Herron: [C:03+1] prometheus: add recording rules for the appservers RED dashboard [puppet] - 10https://gerrit.wikimedia.org/r/1259170 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [16:09:10] (03CR) 10Herron: [C:03+1] "nice!" [alerts] - 10https://gerrit.wikimedia.org/r/1269673 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [16:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:27:15] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon2004-dev - https://phabricator.wikimedia.org/T422437#11809098 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:27:27] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90368 and previous config saved to /var/cache/conftool/dbconfig/20260410-162726-fceratto.json [16:27:33] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:29:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:29:58] (03CR) 10Scott French: [C:03+2] wikikube: Remove comment on coredns replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270056 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French) [16:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:37:34] (03Merged) 10jenkins-bot: wikikube: Remove comment on coredns replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270056 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French) [16:38:15] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P90370 and previous config saved to /var/cache/conftool/dbconfig/20260410-163814-fceratto.json [16:39:33] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421398 [16:39:36] T421398: SystemdUnitFailed - zuul-executor - https://phabricator.wikimedia.org/T421398 [16:49:03] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P90371 and previous config saved to /var/cache/conftool/dbconfig/20260410-164902-fceratto.json [16:51:29] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite) [16:52:20] (03PS1) 10Bking: cirrussearch: move cloudelastic1012 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1270061 (https://phabricator.wikimedia.org/T422860) [16:54:41] (03CR) 10Bking: [C:03+2] cirrussearch: move cloudelastic1012 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1270061 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [16:57:00] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie [16:59:52] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90372 and previous config saved to /var/cache/conftool/dbconfig/20260410-165951-fceratto.json [16:59:56] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:00:11] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance [17:03:19] (03CR) 10Bernard Wang: [C:03+1] Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude) [17:08:44] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [17:11:38] 06SRE, 10Wikimedia-Mailing-lists, 07Sustainability (Incident Followup): lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#11809235 (10Ladsgroup) I honestly think this is a discussion for sre-collab team since they own mailman now. [17:12:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [17:26:11] !log dancy@deploy1003 Installing scap version "4.248.0" for 2 host(s) [17:27:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS trixie [17:28:02] !log dancy@deploy1003 Installation of scap version "4.248.0" completed for 2 hosts [17:41:10] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:14] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:51:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:54:44] 06SRE, 10SRE-Access-Requests, 07Essential-Work: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11809352 (10JMoore-WMF) hi- i'm unable to access https://superset.wikimedia.org/superset/dashboard/409/?native_filters_key... [17:55:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:57:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:00:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:00:52] 06SRE, 10SRE-Access-Requests, 07Essential-Work: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11809441 (10mpopov) I suspect Justin is seeing the same error as me: ` Error: {'message': 'Permission denied: user=bearlo... [18:01:52] (03CR) 10Dwisehaupt: [V:03+1] "@adenisse@wikimedia.org @kherron@wikimedia.org Adding you two in as reviewers for this. As stated in T422888, I'm not 100% certain that sr" [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt) [18:06:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:06:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:09:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:09:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:11:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:11:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:18:48] (03PS1) 10Zabe: NewFilesPager: Make sure filerevision is queried before file [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270068 (https://phabricator.wikimedia.org/T422946) [18:23:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:26:59] !log dancy@deploy1003 Started deploy [releng/jenkins-deploy@46eae53] (releasing): (no justification provided) [18:27:53] !log dancy@deploy1003 Finished deploy [releng/jenkins-deploy@46eae53] (releasing): (no justification provided) (duration: 00m 56s) [18:34:08] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance [18:34:56] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2206 (T419635)', diff saved to https://phabricator.wikimedia.org/P90373 and previous config saved to /var/cache/conftool/dbconfig/20260410-183455-fceratto.json [18:35:00] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:39:13] (03PS1) 10Bking: opensearch: move cloudelastic1012 back into prod role [puppet] - 10https://gerrit.wikimedia.org/r/1270071 (https://phabricator.wikimedia.org/T422860) [18:40:26] (03CR) 10Bking: [C:03+2] opensearch: move cloudelastic1012 back into prod role [puppet] - 10https://gerrit.wikimedia.org/r/1270071 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [19:25:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:26:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:26:46] (03CR) 10Jdlrobson: [C:03+1] Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude) [19:28:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:31:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:32:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:34:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:35:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:37:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:40:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:43:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11809628 (10VRiley-WMF) Finally was able to get Dell to send out a new part for the unit. Part should arrive next business day. [19:47:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:47:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:47:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:48:25] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:48:30] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:50:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:51:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:52:23] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:55:32] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:56:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:56:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:57:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:57:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:57:55] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:58:38] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:02:54] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:05:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11809647 (10VRiley-WMF) I seem to be running into the same issue that @Jhancock.wm is running into with T418899. Awaiting to see what the fix would be. [20:21:00] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T419635)', diff saved to https://phabricator.wikimedia.org/P90376 and previous config saved to /var/cache/conftool/dbconfig/20260410-202059-fceratto.json [20:21:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [20:30:43] (03PS1) 10Bking: cloudelastic: remove logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/1270082 (https://phabricator.wikimedia.org/T324335) [20:31:49] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P90377 and previous config saved to /var/cache/conftool/dbconfig/20260410-203147-fceratto.json [20:31:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270082 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [20:40:12] (03CR) 10Bking: [C:03+2] cloudelastic: remove logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/1270082 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [20:42:38] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P90378 and previous config saved to /var/cache/conftool/dbconfig/20260410-204236-fceratto.json [20:48:25] FIRING: [8x] SystemdUnitFailed: opensearch-disable-readahead-cloudelastic-chi-eqiad.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:25] FIRING: [9x] SystemdUnitFailed: nginx.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:25] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T419635)', diff saved to https://phabricator.wikimedia.org/P90380 and previous config saved to /var/cache/conftool/dbconfig/20260410-205324-fceratto.json [20:53:29] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [20:53:33] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2210.codfw.wmnet with reason: Maintenance [20:54:21] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2210 (T419635)', diff saved to https://phabricator.wikimedia.org/P90381 and previous config saved to /var/cache/conftool/dbconfig/20260410-205420-fceratto.json [20:56:45] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [20:57:47] ^^ known [20:57:59] Will set a suppression for this guy [20:59:26] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cloudelastic1012.eqiad.wmnet with reason: still fixing Puppet [21:17:55] PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:17:55] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:17:55] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:17:56] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:17:57] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:19:25] ^^ known, should clear shortly [21:20:55] RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 85 days) https://wikitech.wikimedia.org/wiki/Search [21:20:55] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 85 days) https://wikitech.wikimedia.org/wiki/Search [21:20:55] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 85 days) https://wikitech.wikimedia.org/wiki/Search [21:20:56] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 85 days) https://wikitech.wikimedia.org/wiki/Search [21:24:44] (03PS1) 10Bking: nginx tls proxy: remove defunct directive [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) [21:25:08] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [21:33:37] (03CR) 10Ryan Kemper: [C:03+1] nginx tls proxy: remove defunct directive [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [21:37:59] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie [21:38:16] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11809821 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie [21:52:19] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11809846 (10Jclark-ctr) I have not had any luck with getting it to power on. I will start Monday with pulling parts from decom servers to try to get it back up. [21:52:53] RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 27.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:53:07] 06SRE, 06Infrastructure-Foundations: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440#11809847 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF Yup, we're using it in production! [21:53:11] RECOVERY - MariaDB Replica Lag: s7 on db1155 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:53:25] RECOVERY - MariaDB Replica Lag: s7 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:53:51] RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:53:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T410589)', diff saved to https://phabricator.wikimedia.org/P90382 and previous config saved to /var/cache/conftool/dbconfig/20260410-215358-ladsgroup.json [21:54:02] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:54:42] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:57:11] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:58:23] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:00:24] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:02:35] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:04:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P90383 and previous config saved to /var/cache/conftool/dbconfig/20260410-220406-ladsgroup.json [22:07:25] (03CR) 10Cwhite: [C:03+2] add beta-logs pki key [labs/private] - 10https://gerrit.wikimedia.org/r/1268683 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [22:07:34] (03CR) 10Cwhite: [V:03+2 C:03+2] add beta-logs pki key [labs/private] - 10https://gerrit.wikimedia.org/r/1268683 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [22:08:53] (03CR) 10Cwhite: [C:03+2] initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [22:10:28] jclark@cumin1003 provision (PID 1655695) is awaiting input [22:13:52] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:14:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P90384 and previous config saved to /var/cache/conftool/dbconfig/20260410-221414-ladsgroup.json [22:17:49] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host phab1006.eqiad.wmnet with OS trixie [22:24:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T410589)', diff saved to https://phabricator.wikimedia.org/P90385 and previous config saved to /var/cache/conftool/dbconfig/20260410-222421-ladsgroup.json [22:24:25] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [22:24:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:24:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T410589)', diff saved to https://phabricator.wikimedia.org/P90386 and previous config saved to /var/cache/conftool/dbconfig/20260410-222445-ladsgroup.json [22:26:27] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11809888 (10Jclark-ctr) @Dzahn next week can you update puppet this is a uefi only server [22:28:15] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab1006.eqiad.wmnet with OS trixie [22:30:08] (03PS1) 10Cwhite: logging: add dummy pki "secrets" [labs/private] - 10https://gerrit.wikimedia.org/r/1270089 (https://phabricator.wikimedia.org/T350516) [22:30:43] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-be1005.eqiad.wmnet with OS bookworm [22:30:56] (03CR) 10Cwhite: [V:03+2 C:03+2] logging: add dummy pki "secrets" [labs/private] - 10https://gerrit.wikimedia.org/r/1270089 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [22:31:03] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-be1005.eqiad.wmnet with OS bookworm [22:31:59] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-be1005.eqiad.wmnet with OS bookworm [22:32:17] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-be1005.eqiad.wmnet with OS bookworm [22:33:20] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-be1005.eqiad.wmnet with OS bookworm [22:34:00] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11809908 (10MisterSynergy) >>! In T421642#11785461, @Xqt wrote: > The problems began on March 25th: > {F74901675} Exact timestamp seems to be shortly after 2026-03-25 1... [22:40:09] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T419635)', diff saved to https://phabricator.wikimedia.org/P90387 and previous config saved to /var/cache/conftool/dbconfig/20260410-224008-fceratto.json [22:40:12] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [22:46:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/1/5 (Transport: cr2-codfw:et-0/1/4 (Lumen, 449169461) {#changeme_lumen_patch}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:50:57] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P90388 and previous config saved to /var/cache/conftool/dbconfig/20260410-225055-fceratto.json [22:52:17] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:55:22] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-be1005.eqiad.wmnet with reason: host reimage [22:59:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-be1005.eqiad.wmnet with reason: host reimage [23:01:44] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P90389 and previous config saved to /var/cache/conftool/dbconfig/20260410-230143-fceratto.json [23:02:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:05:04] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-be1006.eqiad.wmnet with OS bookworm [23:12:32] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T419635)', diff saved to https://phabricator.wikimedia.org/P90390 and previous config saved to /var/cache/conftool/dbconfig/20260410-231231-fceratto.json [23:12:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [23:12:51] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2219.codfw.wmnet with reason: Maintenance [23:13:38] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2219 (T419635)', diff saved to https://phabricator.wikimedia.org/P90391 and previous config saved to /var/cache/conftool/dbconfig/20260410-231337-fceratto.json [23:16:43] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:17:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:17:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be1005.eqiad.wmnet with OS bookworm [23:18:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:20:02] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-be1006.eqiad.wmnet with reason: host reimage [23:24:08] PROBLEM - Host apus-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-be1006.eqiad.wmnet with reason: host reimage [23:26:34] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11810040 (10Dzahn) @Jclark-ctr Ok, but I'm not sure what kind of update it needs. I never had a UEFI-only server I think. [23:27:39] RECOVERY - Host apus-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [23:29:42] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11810051 (10Jclark-ctr) currently preseed.yaml is ` 'pki*|phab*': - partman/standard.cfg - partman/raid1-2dev.cfg ` for efi would need ` - partman/standard.cf... [23:33:05] (03PS1) 10Dzahn: zuul: mount /var/ssh/zuul for zuul-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1270103 (https://phabricator.wikimedia.org/T422895) [23:33:57] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11810070 (10Jclark-ctr) @Dzahn if you do make any edits please keep in mind of codfw has same server racked needing same update T418899 phab2003 [23:39:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270106 [23:39:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270106 (owner: 10TrainBranchBot) [23:40:24] (03PS1) 10Dzahn: installserver: set UEFI-only recipes for newer phab* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1270107 (https://phabricator.wikimedia.org/T418905) [23:40:48] (03PS2) 10Dzahn: installserver: set UEFI-only recipes for newer phab* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1270107 (https://phabricator.wikimedia.org/T418905) [23:41:33] (03CR) 10Dzahn: [C:03+2] installserver: set UEFI-only recipes for newer phab* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1270107 (https://phabricator.wikimedia.org/T418905) (owner: 10Dzahn) [23:47:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:48:27] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11810160 (10Dzahn) @Jclark-ctr Gotcha! I did make the change. It should be for phab1006, phab2003 and future phab* servers with higher numbers. Done. [23:48:36] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:49:33] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:49:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:49:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be1006.eqiad.wmnet with OS bookworm [23:50:31] (03CR) 10Dzahn: [C:03+2] zuul: mount /var/ssh/zuul for zuul-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1270103 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn) [23:50:36] (03PS2) 10Dzahn: zuul: mount /var/ssh/zuul for zuul-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1270103 (https://phabricator.wikimedia.org/T422895) [23:51:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270106 (owner: 10TrainBranchBot) [23:51:28] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:53:46] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:54:54] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:54:58] (03CR) 10Dzahn: [C:03+2] zuul: mount /var/ssh/zuul for zuul-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1270103 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn)