[00:00:41] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1269720|Disable query pages on testcommonswiki not compatible with split (T421914)]]
[00:00:45] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[00:02:25] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1269720|Disable query pages on testcommonswiki not compatible with split (T421914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:03:49] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-be2005.codfw.wmnet with reason: host reimage
[00:04:13] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[00:07:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:07:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:07:45] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be2006.codfw.wmnet with OS bookworm
[00:07:53] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11806988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-be2006.codfw.wmnet with OS bookworm comp...
[00:07:58] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269720|Disable query pages on testcommonswiki not compatible with split (T421914)]] (duration: 07m 17s)
[00:08:01] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[00:21:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:24:24] <logmsgbot>	 jhancock@cumin2002 reimage (PID 1591632) is awaiting input
[00:26:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:26:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be2005.codfw.wmnet with OS bookworm
[00:26:18] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11807025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-be2005.codfw.wmnet with OS bookworm comp...
[00:29:27] <zabe>	 !log marked 425 content rows as bad # T393237
[00:29:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:30] <stashbot>	 T393237: Some en.wikipedia pageviews fatal "RevisionAccessException: Failed to load data blob from {address} for revision {revision}." - https://phabricator.wikimedia.org/T393237
[00:33:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:36:19] <wikibugs>	 (03PS1) 10Zabe: Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914)
[00:37:25] <wikibugs>	 (03PS2) 10Zabe: Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914)
[00:39:25] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Start reading from new file tables on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269086 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[00:40:19] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from new file tables on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269086 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[00:40:38] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1269086|Start reading from new file tables on enwiki (T416548)]]
[00:40:41] <stashbot>	 T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
[00:41:56] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe)
[00:42:21] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1269086|Start reading from new file tables on enwiki (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:43:04] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[00:46:49] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269086|Start reading from new file tables on enwiki (T416548)]] (duration: 06m 11s)
[00:46:53] <stashbot>	 T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
[00:47:22] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe)
[00:48:06] <wikibugs>	 (03PS15) 10Ryan Kemper: cloudelastic: Prepare for opensearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[00:48:13] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[00:48:14] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting specific virtual domain for link tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269744 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe)
[00:48:25] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11807070 (10Jhancock.wm)
[00:48:33] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1269744|Stop setting specific virtual domain for link tables (T421914)]]
[00:48:36] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[00:48:43] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11807072 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @MatthewVernon all yours
[00:50:20] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1269744|Stop setting specific virtual domain for link tables (T421914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:50:39] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[00:51:37] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] function-{evaluator,orchestrator}: set AppArmor profile in container SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[00:53:50] <wikibugs>	 (03Merged) 10jenkins-bot: function-{evaluator,orchestrator}: set AppArmor profile in container SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[00:54:24] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269744|Stop setting specific virtual domain for link tables (T421914)]] (duration: 05m 51s)
[00:54:27] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[00:57:47] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[00:57:56] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[01:09:52] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755
[01:09:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755 (owner: 10TrainBranchBot)
[01:20:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755 (owner: 10TrainBranchBot)
[01:22:59] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[01:23:55] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[01:25:43] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[01:26:21] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Stop testing the v1 orchestrator endpoint, we're dropping it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269072 (https://phabricator.wikimedia.org/T421768) (owner: 10Jforrester)
[01:26:36] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[01:28:21] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Stop testing the v1 orchestrator endpoint, we're dropping it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269072 (https://phabricator.wikimedia.org/T421768) (owner: 10Jforrester)
[01:30:16] <wikibugs>	 (03PS1) 10Zabe: Set $wgGlobalUsageSharedRepoWiki for testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914)
[01:30:28] <wikibugs>	 (03CR) 10Zabe: "Do we want this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe)
[01:31:40] <wikibugs>	 (03PS1) 10Zabe: Also disable updates for GloballyWantedFiles on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269759 (https://phabricator.wikimedia.org/T421914)
[03:05:41] <wikibugs>	 (03CR) 10Anzx: Drop 1.5x logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery)
[03:58:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:26:09] <wikibugs>	 (03PS2) 10Ryan Kemper: growthbook: Add API key placeholders for automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696)
[05:03:59] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Wipe clouddb1019 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1269467 (https://phabricator.wikimedia.org/T422813) (owner: 10Marostegui)
[05:05:45] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie
[05:06:02] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 3 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11807371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
[05:10:23] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 3 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11807373 (10Marostegui) p:05Triage→03Medium
[05:40:40] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11807424 (10Marostegui) @Jclark-ctr I am not able to reimage the host, it is not rebooting, can you check onsite what's on the screen? I've tried several times to reboot it...
[05:45:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:46:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:46:47] <wikibugs>	 (03PS1) 10Marostegui: clouddb1019: Adding a note [puppet] - 10https://gerrit.wikimedia.org/r/1269841
[05:46:59] <wikibugs>	 (03CR) 10Marostegui: "This is a noop - a note for future usage" [puppet] - 10https://gerrit.wikimedia.org/r/1269841 (owner: 10Marostegui)
[05:47:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:47:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:47:24] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] clouddb1019: Adding a note [puppet] - 10https://gerrit.wikimedia.org/r/1269841 (owner: 10Marostegui)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260410T0600)
[06:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:26:51] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1019.eqiad.wmnet with OS trixie
[06:27:04] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11807473 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors: -...
[06:37:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stats hosts for Daniel Kinzler - https://phabricator.wikimedia.org/T422827#11807481 (10daniel) The SSH key is the one that I also use for access to the deployment hosts, is that ok?
[07:00:06] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260410T0700)
[07:09:06] <logmsgbot>	 !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on gitlab1003.wikimedia.org with reason: Upgrade
[07:28:05] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "that's a relatively new service from Nat and probably also just read-only but I have to double check with Nat." [dns] - 10https://gerrit.wikimedia.org/r/1269452 (https://phabricator.wikimedia.org/T422819) (owner: 10Jelto)
[07:29:21] <logmsgbot>	 !log jelto@dns1004 START - running authdns-update
[07:30:47] <logmsgbot>	 !log jelto@dns1004 END - running authdns-update
[07:50:16] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11807564 (10MatthewVernon) I've eyeballed the discussion here - AFAICT apus is behaving as expected? I have...
[07:57:36] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] growthbook: Add API key placeholders for automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper)
[07:58:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:04:15] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:09:14] <wikibugs>	 (03PS1) 10Jelto: gitlab::rsync: add misssing ensure [puppet] - 10https://gerrit.wikimedia.org/r/1269865 (https://phabricator.wikimedia.org/T422858)
[08:12:47] <wikibugs>	 (03PS8) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)
[08:12:47] <wikibugs>	 (03PS7) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475)
[08:13:09] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8405/console" [puppet] - 10https://gerrit.wikimedia.org/r/1269865 (https://phabricator.wikimedia.org/T422858) (owner: 10Jelto)
[08:13:14] <wikibugs>	 (03CR) 10Elukey: Move linting to Ruff and apply code fixes (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[08:14:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:14:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1266959 (owner: 10Muehlenhoff)
[08:14:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:16:07] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab::rsync: add misssing ensure [puppet] - 10https://gerrit.wikimedia.org/r/1269865 (https://phabricator.wikimedia.org/T422858) (owner: 10Jelto)
[08:16:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:17:11] <wikibugs>	 (03PS1) 10Elukey: aux-k8s-services: update Jaeger's Istio DR after k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269868 (https://phabricator.wikimedia.org/T414486)
[08:19:25] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[08:19:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[08:21:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[08:22:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:24:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:25:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:27:37] <wikibugs>	 (03PS1) 10Tiziano Fogli: thanos/compact: reduce concurrency due to disk constraints [puppet] - 10https://gerrit.wikimedia.org/r/1269955 (https://phabricator.wikimedia.org/T386911)
[08:29:02] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: reduce concurrency due to disk constraints [puppet] - 10https://gerrit.wikimedia.org/r/1269955 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli)
[08:32:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:32:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:33:21] <wikibugs>	 (03CR) 10Clément Goubert: "One last small change and then lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[08:33:48] <wikibugs>	 (03PS1) 10MVernon: apus: add two new storage nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902)
[08:34:12] <wikibugs>	 (03PS2) 10MVernon: apus: add two new storage nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902)
[08:34:15] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:34:49] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon)
[08:38:32] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11807714 (10MatthewVernon) Thanks @Jhancock.wm :)
[08:41:11] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "The hostnames match the description and the related task. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon)
[08:42:44] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:44:32] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:44:56] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:45:00] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:45:13] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:45:16] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:45:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventgate-main.svc.codfw.wmnet:4492 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[08:50:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventgate-main.svc.codfw.wmnet:4492 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[08:54:14] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:54:18] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:54:31] <wikibugs>	 (03CR) 10Hashar: "That one failed due to T422907 , I have reverted Vector patch https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1269962" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755 (owner: 10TrainBranchBot)
[08:54:33] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:54:40] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:55:00] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[09:00:24] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:01:29] <wikibugs>	 (03PS9) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)
[09:01:29] <wikibugs>	 (03PS8) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475)
[09:01:29] <wikibugs>	 (03PS1) 10Elukey: tests: fix icinga tests when running on py3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968
[09:02:21] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:02:58] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:04:41] <wikibugs>	 (03PS1) 10Federico Ceratto: admin: Add second U2F key, remove non-U2F SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1269970
[09:04:41] <wikibugs>	 (03CR) 10Federico Ceratto: "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1269970 (owner: 10Federico Ceratto)
[09:04:46] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:06:17] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] aux-k8s-services: update Jaeger's Istio DR after k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269868 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[09:07:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tests: fix icinga tests when running on py3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968 (owner: 10Elukey)
[09:13:02] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:14:49] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:15:22] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:16:25] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance
[09:16:59] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert "rest-gateway: Add api.w.o device-analytics support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 (owner: 10Clément Goubert)
[09:17:12] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:17:14] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2147 (T419635)', diff saved to https://phabricator.wikimedia.org/P90346 and previous config saved to /var/cache/conftool/dbconfig/20260410-091713-fceratto.json
[09:17:17] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:18:06] <wikibugs>	 (03PS2) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804)
[09:18:34] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[09:19:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "rest-gateway: Add api.w.o device-analytics support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 (owner: 10Clément Goubert)
[09:19:04] <wikibugs>	 (03CR) 10Elukey: [C:03+2] aux-k8s-services: update Jaeger's Istio DR after k8s upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269868 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[09:21:13] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[09:21:22] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[09:21:32] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[09:21:48] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[09:22:06] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[09:22:21] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[09:24:19] <wikibugs>	 (03PS1) 10Andrew McAllister (WMDE): Allow WMDE Airflow instance to egress to dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269978 (https://phabricator.wikimedia.org/T414583)
[09:24:21] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: sync
[09:24:27] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: sync
[09:24:28] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:24:37] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: sync
[09:24:46] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: sync
[09:24:47] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:26:04] <wikibugs>	 (03PS2) 10Elukey: tests: fix icinga tests when running on py3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968
[09:26:04] <wikibugs>	 (03PS10) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)
[09:26:04] <wikibugs>	 (03PS9) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475)
[09:29:38] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[09:29:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[09:30:11] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[09:30:38] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968 (owner: 10Elukey)
[09:31:09] <wikibugs>	 (03CR) 10Volans: Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[09:31:32] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[09:34:14] <wikibugs>	 (03CR) 10Elukey: [C:03+2] tests: fix icinga tests when running on py3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269968 (owner: 10Elukey)
[09:34:46] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[09:35:56] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:36:03] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:39:49] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[09:41:27] <wikibugs>	 (03PS3) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730)
[09:42:53] <wikibugs>	 (03Merged) 10jenkins-bot: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[09:43:02] <wikibugs>	 (03CR) 10Atsuko: "I addressed the comments and also elaborated on the decision for not adding the script for default deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko)
[09:45:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:48:06] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[09:48:20] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[09:50:06] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[09:53:10] <wikibugs>	 06SRE: Add atsuko to icinga - https://phabricator.wikimedia.org/T422921 (10atsuko) 03NEW
[09:54:33] <wikibugs>	 (03PS1) 10Volans: reposync: fix unit test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269986
[09:55:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:56:21] <wikibugs>	 (03PS1) 10Atsuko: icinga: add Atsuko Ito to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1269987 (https://phabricator.wikimedia.org/T422921)
[09:57:54] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[09:58:00] <wikibugs>	 (03PS1) 10Fabfur: cache::aptrepo: restore haproxy28 component and update [puppet] - 10https://gerrit.wikimedia.org/r/1269988
[09:58:07] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:00:29] <wikibugs>	 (03CR) 10Elukey: [C:03+1] reposync: fix unit test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269986 (owner: 10Volans)
[10:01:20] <wikibugs>	 (03CR) 10Volans: [C:03+2] reposync: fix unit test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269986 (owner: 10Volans)
[10:02:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:03:28] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T422668
[10:04:43] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Sustainability (Incident Followup): lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#11808033 (10MLechvien-WMF) a:03Ladsgroup @Ladsgroup  This is not an Active Investigation and qualifies more for an Follow up Action...
[10:05:51] <wikibugs>	 (03PS3) 10Majavah: hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979
[10:07:51] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:08:03] <wikibugs>	 (03Merged) 10jenkins-bot: reposync: fix unit test [software/spicerack] - 10https://gerrit.wikimedia.org/r/1269986 (owner: 10Volans)
[10:09:40] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:09:47] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:10:07] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:10:23] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:10:36] <wikibugs>	 (03PS2) 10Vgutierrez: cache::aptrepo: restore haproxy28 component and update [puppet] - 10https://gerrit.wikimedia.org/r/1269988 (https://phabricator.wikimedia.org/T422926) (owner: 10Fabfur)
[10:10:55] <wikibugs>	 (03PS1) 10Elukey: haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992
[10:11:07] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:11:16] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:11:36] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:12:22] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] cache::aptrepo: restore haproxy28 component and update [puppet] - 10https://gerrit.wikimedia.org/r/1269988 (https://phabricator.wikimedia.org/T422926) (owner: 10Fabfur)
[10:13:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LG!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269978 (https://phabricator.wikimedia.org/T414583) (owner: 10Andrew McAllister (WMDE))
[10:14:29] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:16:18] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:16:24] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:18:14] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:18:16] <wikibugs>	 (03CR) 10Elukey: "$ docker run -ti docker-registry.wikimedia.org/haproxy:3.2.15-1 -vv" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (owner: 10Elukey)
[10:19:06] <vgutierrez>	 !log upload haproxy 2.8.20 to thirdparty/haproxy28 for bookworm-wikimedia (apt.wm.o) - T422926
[10:19:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:10] <stashbot>	 T422926: Thumbor is using an unmantained HAProxy version - https://phabricator.wikimedia.org/T422926
[10:19:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] icinga: add Atsuko Ito to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1269987 (https://phabricator.wikimedia.org/T422921) (owner: 10Atsuko)
[10:20:14] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:20:22] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:21:13] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[10:25:44] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:27:32] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:27:39] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:30:00] <wikibugs>	 (03PS2) 10Clément Goubert: haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[10:30:04] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:31:31] <wikibugs>	 (03CR) 10Brouberol: airflow: dag filter helper function (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko)
[10:35:32] <wikibugs>	 (03PS1) 10Elukey: istio: revisit Prometheus buckets for ML's gateway/sidecar sources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886)
[10:37:22] <wikibugs>	 (03CR) 10Elukey: "Kicked off the conversation with some high level values, lemme know if you want to change them further. My goal is to have a standard that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[10:43:50] <wikibugs>	 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11808224 (10FCeratto-WMF) That's pretty much the issue to discuss: we have only very few warnings on IRC (not pages) as datapoints and no immediate way to simulate...
[10:53:48] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "from a traffic PoV this makes more sense than maintaining a 2.8 version with bookworm given that we get rid of OpenSSL 3.0 in favor of 3.5" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[10:58:12] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[11:00:02] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[11:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260410T0700)
[11:00:05] <jouncebot>	 jelto, arnoldokoth, mutante, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260410T1100).
[11:00:38] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[11:01:32] <logmsgbot>	 aokoth@cumin1003 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade.
[11:02:29] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[11:02:44] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[11:03:32] <jinxer-wm>	 FIRING: SLOBudgetBurn: Search update lag is below 95% target in codfw   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:04:41] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[11:04:59] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[11:06:36] <wikibugs>	 (03CR) 10Atsuko: "all addressed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko)
[11:06:55] <wikibugs>	 (03PS4) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730)
[11:07:01] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:08:03] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 116992 bytes in 2.368 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:08:32] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Search update lag is below 95% target in codfw   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:10:47] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Release - T422668
[11:11:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:54] <wikibugs>	 (03PS5) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730)
[11:13:37] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko)
[11:14:59] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] icinga: add Atsuko Ito to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1269987 (https://phabricator.wikimedia.org/T422921) (owner: 10Atsuko)
[11:15:02] <jinxer-wm>	 FIRING: [3x] SLOBudgetBurn: Search update lag is below 95% target in codfw   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:15:05] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[11:16:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:16:52] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko)
[11:16:53] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[11:16:55] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T419635)', diff saved to https://phabricator.wikimedia.org/P90347 and previous config saved to /var/cache/conftool/dbconfig/20260410-111654-fceratto.json
[11:16:58] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:19:08] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[11:19:40] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko)
[11:20:02] <jinxer-wm>	 FIRING: [5x] SLOBudgetBurn: Search update lag is below 95% target in codfw   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:20:59] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[11:22:54] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[11:25:02] <jinxer-wm>	 FIRING: [7x] SLOBudgetBurn: Search update lag is below 95% target in codfw   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:27:43] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P90348 and previous config saved to /var/cache/conftool/dbconfig/20260410-112742-fceratto.json
[11:30:02] <jinxer-wm>	 RESOLVED: [7x] SLOBudgetBurn: Search update lag is below 95% target in codfw   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:30:57] <wikibugs>	 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11808307 (10NakavoDev) >>! In T422872#11806792, @Reedy wrote: > By you extract one of the links... What do you mean? >  > Are you always getting thumbs? Or are you sometimes (often?) requesting the originals based on siz...
[11:33:40] <wikibugs>	 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11808312 (10Marostegui) I'm on call next week so we can force a lag page
[11:37:42] <wikibugs>	 (03PS1) 10Aude: Opt-in new accounts to ReadingLists beta feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833)
[11:38:32] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P90349 and previous config saved to /var/cache/conftool/dbconfig/20260410-113830-fceratto.json
[11:40:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude)
[11:49:21] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T419635)', diff saved to https://phabricator.wikimedia.org/P90350 and previous config saved to /var/cache/conftool/dbconfig/20260410-114919-fceratto.json
[11:49:25] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:49:28] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance
[11:50:12] <wikibugs>	 06SRE, 10Observability-Alerting, 07Sustainability (Incident Followup): Paging alert on combination of hosts down and a BGP outage - https://phabricator.wikimedia.org/T417051#11808348 (10MLechvien-WMF) tagging #sre_observability as I'm not sure the Alerting tag is used
[11:50:17] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2155 (T419635)', diff saved to https://phabricator.wikimedia.org/P90351 and previous config saved to /var/cache/conftool/dbconfig/20260410-115015-fceratto.json
[11:58:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:09:07] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[12:09:18] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[12:15:17] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:22:14] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:24:08] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:24:16] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:25:45] <wikibugs>	 06SRE: Add atsuko to icinga - https://phabricator.wikimedia.org/T422921#11808463 (10atsuko) 05Open→03In progress
[12:26:10] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:27:38] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:32:24] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:33:38] <wikibugs>	 06SRE: Add atsuko to icinga - https://phabricator.wikimedia.org/T422921#11808476 (10atsuko) 05In progress→03Resolved puppet changes: merged puppet-private changes: `8af215ff3c0c08599a9e52ff5855d197a835c418`
[12:34:16] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:37:06] <wikibugs>	 (03PS1) 10Brouberol: growthbook-next: deploy an image containing an unreleased feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270028 (https://phabricator.wikimedia.org/T420781)
[12:39:41] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] growthbook-next: deploy an image containing an unreleased feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270028 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol)
[12:41:35] <wikibugs>	 (03Merged) 10jenkins-bot: growthbook-next: deploy an image containing an unreleased feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270028 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol)
[12:44:42] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:46:32] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:47:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply
[12:51:56] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:52:56] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply
[12:54:31] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[12:54:42] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[12:55:37] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:57:13] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[12:57:26] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:57:33] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:57:46] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[12:57:46] <wikibugs>	 (03PS1) 10Ladsgroup: envoy: Close connections to swift after 10s of inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1270031 (https://phabricator.wikimedia.org/T328872)
[12:59:23] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:01:40] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove dns for decom lumen transport cct - cmooney@cumin1003"
[13:01:45] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove include for 2620:0:861:fe06::/64 link range [dns] - 10https://gerrit.wikimedia.org/r/1270032 (https://phabricator.wikimedia.org/T395878)
[13:02:06] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove dns for decom lumen transport cct - cmooney@cumin1003"
[13:02:06] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:02:07] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[13:02:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:05:43] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox
[13:06:36] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove include for 2620:0:861:fe06::/64 link range [dns] - 10https://gerrit.wikimedia.org/r/1270032 (https://phabricator.wikimedia.org/T395878) (owner: 10Cathal Mooney)
[13:06:53] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[13:07:11] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[13:08:15] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[13:12:04] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] envoy: Close connections to swift after 10s of inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1270031 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[13:15:10] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:16:59] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:19:50] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:19:53] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[13:20:11] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[13:21:12] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[13:21:20] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T410589)', diff saved to https://phabricator.wikimedia.org/P90357 and previous config saved to /var/cache/conftool/dbconfig/20260410-132119-ladsgroup.json
[13:21:23] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[13:21:41] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:22:08] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[13:22:16] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T410589)', diff saved to https://phabricator.wikimedia.org/P90358 and previous config saved to /var/cache/conftool/dbconfig/20260410-132215-ladsgroup.json
[13:28:21] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:29:54] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[13:30:11] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[13:30:12] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:30:24] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:30:28] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:32:16] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:32:21] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:32:36] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:36:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "I don't feel strongly either way tbh, I'd say let's go for it for now" [puppet] - 10https://gerrit.wikimedia.org/r/1268979 (owner: 10Majavah)
[13:36:38] <wikibugs>	 (03PS10) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877)
[13:37:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11808649 (10ssingh)
[13:40:56] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[13:44:35] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[13:44:40] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[13:49:36] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T419635)', diff saved to https://phabricator.wikimedia.org/P90362 and previous config saved to /var/cache/conftool/dbconfig/20260410-134935-fceratto.json
[13:49:40] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:00:25] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P90363 and previous config saved to /var/cache/conftool/dbconfig/20260410-140023-fceratto.json
[14:01:41] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270038
[14:08:12] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic: Prepare for opensearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[14:10:21] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:10:22] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:10:56] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:11:13] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P90365 and previous config saved to /var/cache/conftool/dbconfig/20260410-141112-fceratto.json
[14:11:27] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:11:50] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:12:18] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:13:08] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:13:27] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:13:46] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:13:58] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:17:51] <wikibugs>	 (03CR) 10JMeybohm: "Feel free to also loop in the friendly folks from netops ( @cmooney@wikimedia.org || @ayounsi@wikimedia.org) to double check if these quer" [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[14:19:09] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270038 (owner: 10Gkyziridis)
[14:19:40] <wikibugs>	 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11808793 (10Reedy) Do you know what stats of thumbnail vs original you’re requesting?  Generally, thumbnails are definitely preferred, so if you’re preferring original because it’s first match, that will start to explain...
[14:21:23] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270038 (owner: 10Gkyziridis)
[14:22:02] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T419635)', diff saved to https://phabricator.wikimedia.org/P90366 and previous config saved to /var/cache/conftool/dbconfig/20260410-142200-fceratto.json
[14:22:05] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:22:21] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance
[14:23:09] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90367 and previous config saved to /var/cache/conftool/dbconfig/20260410-142308-fceratto.json
[14:25:09] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:27:00] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:30:40] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:30:48] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:32:29] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:35:27] <wikibugs>	 (03PS1) 10Aude: Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885)
[14:35:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude)
[14:36:24] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[14:36:30] <wikibugs>	 (03PS11) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877)
[14:38:33] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:38:41] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:38:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[14:40:17] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:40:52] <wikibugs>	 (03PS12) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877)
[14:41:27] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO)
[14:44:02] <wikibugs>	 (03CR) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[14:45:25] <wikibugs>	 06SRE-OnFire, 10corto: Cortobot help command should not spam the main channel - https://phabricator.wikimedia.org/T421858#11808871 (10Peachey88)
[14:45:26] <wikibugs>	 06SRE-OnFire, 10corto: Harden corto systemd service - https://phabricator.wikimedia.org/T372437#11808872 (10Peachey88)
[14:45:28] <wikibugs>	 06SRE-OnFire, 10corto, 10Incident Tooling: corto: track responders - https://phabricator.wikimedia.org/T391897#11808873 (10Peachey88)
[14:45:31] <wikibugs>	 06SRE-OnFire, 10corto, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785#11808874 (10Peachey88)
[14:45:32] <wikibugs>	 06SRE-OnFire, 10corto, 10Incident Tooling: Corto: Functional & Integration testing - https://phabricator.wikimedia.org/T377036#11808875 (10Peachey88)
[14:45:58] <wikibugs>	 (03PS13) 10Blake: kubernetes-generic: Add alerts for BGP failure scenarios. [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877)
[14:48:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stats hosts for Daniel Kinzler - https://phabricator.wikimedia.org/T422827#11808892 (10daniel) 05Open→03Invalid Turns out I'm already in the relevant LDAP group. I filed a separate ticket for the Kerberos credentials: T422947
[14:53:42] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:54:15] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[14:57:42] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11808913 (10taavi) 05Open→03Resolved a:03ABran-WMF The patch is merged...
[15:00:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:05:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:08:11] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO)
[15:12:11] <wikibugs>	 (03CR) 10Anzx: [C:04-1] "create new patch for deleting logo files and schedule it a week later, because some files may be cached it would be safe delete after a fe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO)
[15:14:12] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11808965 (10Scott_French) >>! In T422166#11807564, @MatthewVernon wrote: > I've eyeballed the discussion he...
[15:16:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie
[15:17:14] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephmon1006.eqiad.wmnet
[15:20:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:23:57] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon1006.eqiad.wmnet
[15:26:57] <wikibugs>	 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11808983 (10NakavoDev) >>! In T422872#11808793, @Reedy wrote: > Do you know what stats of thumbnail vs original you’re requesting? >  > Generally, thumbnails are definitely preferred, so if you’re preferring original bec...
[15:27:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[15:29:00] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gitlab::rsync: add misssing ensure [puppet] - 10https://gerrit.wikimedia.org/r/1269865 (https://phabricator.wikimedia.org/T422858) (owner: 10Jelto)
[15:34:16] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[15:34:49] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:35:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:36:25] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:38:37] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:38:42] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:39:28] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephmon1006.eqiad.wmnet
[15:41:12] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1012.eqiad.wmnet with OS trixie
[15:44:57] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:45:03] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:45:57] <wikibugs>	 (03PS1) 10Scott French: wikikube: Remove comment on coredns replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270056 (https://phabricator.wikimedia.org/T422455)
[15:46:10] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon1006.eqiad.wmnet
[15:49:07] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Overall LGTM, few comments in-line." [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake)
[15:49:16] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] wikikube: Remove comment on coredns replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270056 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French)
[15:55:22] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] "I never learned what our timestamp format is - is that 1pm UTC?  Either way, this looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude)
[15:58:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:59:01] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:59:06] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[16:01:13] <wikibugs>	 (03CR) 10Herron: [C:03+1] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite)
[16:05:34] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox
[16:07:24] <wikibugs>	 (03CR) 10Aude: "yes it is 1pm UTC which is start of the backport window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude)
[16:08:28] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: add recording rules for the appservers RED dashboard [puppet] - 10https://gerrit.wikimedia.org/r/1259170 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan)
[16:09:10] <wikibugs>	 (03CR) 10Herron: [C:03+1] "nice!" [alerts] - 10https://gerrit.wikimedia.org/r/1269673 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli)
[16:09:15] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:27:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon2004-dev - https://phabricator.wikimedia.org/T422437#11809098 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:27:27] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90368 and previous config saved to /var/cache/conftool/dbconfig/20260410-162726-fceratto.json
[16:27:33] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:29:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:29:58] <wikibugs>	 (03CR) 10Scott French: [C:03+2] wikikube: Remove comment on coredns replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270056 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French)
[16:34:15] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:37:34] <wikibugs>	 (03Merged) 10jenkins-bot: wikikube: Remove comment on coredns replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270056 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French)
[16:38:15] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P90370 and previous config saved to /var/cache/conftool/dbconfig/20260410-163814-fceratto.json
[16:39:33] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421398
[16:39:36] <stashbot>	 T421398: SystemdUnitFailed - zuul-executor - https://phabricator.wikimedia.org/T421398
[16:49:03] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P90371 and previous config saved to /var/cache/conftool/dbconfig/20260410-164902-fceratto.json
[16:51:29] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite)
[16:52:20] <wikibugs>	 (03PS1) 10Bking: cirrussearch: move cloudelastic1012 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1270061 (https://phabricator.wikimedia.org/T422860)
[16:54:41] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: move cloudelastic1012 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1270061 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[16:57:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie
[16:59:52] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90372 and previous config saved to /var/cache/conftool/dbconfig/20260410-165951-fceratto.json
[16:59:56] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[17:00:11] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance
[17:03:19] <wikibugs>	 (03CR) 10Bernard Wang: [C:03+1] Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude)
[17:08:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[17:11:38] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Sustainability (Incident Followup): lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#11809235 (10Ladsgroup) I honestly think this is a discussion for sre-collab team since they own mailman now.
[17:12:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[17:26:11] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.248.0" for 2 host(s)
[17:27:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS trixie
[17:28:02] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.248.0" completed for 2 hosts
[17:41:10] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:46:14] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:51:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:54:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 07Essential-Work: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11809352 (10JMoore-WMF) hi- i'm unable to access https://superset.wikimedia.org/superset/dashboard/409/?native_filters_key...
[17:55:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:57:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:00:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:00:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 07Essential-Work: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11809441 (10mpopov) I suspect Justin is seeing the same error as me:  ` Error: {'message': 'Permission denied: user=bearlo...
[18:01:52] <wikibugs>	 (03CR) 10Dwisehaupt: [V:03+1] "@adenisse@wikimedia.org @kherron@wikimedia.org Adding you two in as reviewers for this. As stated in T422888, I'm not 100% certain that sr" [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt)
[18:06:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:06:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:09:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:09:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:11:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:11:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:18:48] <wikibugs>	 (03PS1) 10Zabe: NewFilesPager: Make sure filerevision is queried before file [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270068 (https://phabricator.wikimedia.org/T422946)
[18:23:30] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[18:26:59] <logmsgbot>	 !log dancy@deploy1003 Started deploy [releng/jenkins-deploy@46eae53] (releasing): (no justification provided)
[18:27:53] <logmsgbot>	 !log dancy@deploy1003 Finished deploy [releng/jenkins-deploy@46eae53] (releasing): (no justification provided) (duration: 00m 56s)
[18:34:08] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance
[18:34:56] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2206 (T419635)', diff saved to https://phabricator.wikimedia.org/P90373 and previous config saved to /var/cache/conftool/dbconfig/20260410-183455-fceratto.json
[18:35:00] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[18:39:13] <wikibugs>	 (03PS1) 10Bking: opensearch: move cloudelastic1012 back into prod role [puppet] - 10https://gerrit.wikimedia.org/r/1270071 (https://phabricator.wikimedia.org/T422860)
[18:40:26] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch: move cloudelastic1012 back into prod role [puppet] - 10https://gerrit.wikimedia.org/r/1270071 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[19:25:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:26:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:26:46] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude)
[19:28:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:31:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:32:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:34:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:35:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:37:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:40:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:43:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:44:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11809628 (10VRiley-WMF) Finally was able to get Dell to send out a new part for the unit. Part should arrive next business day.
[19:47:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:47:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:47:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:48:25] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:48:30] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[19:50:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:51:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:52:23] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:55:32] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:56:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:56:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:57:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:57:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:57:55] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:58:38] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:02:54] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:05:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11809647 (10VRiley-WMF) I seem to be running into the same issue that @Jhancock.wm is running into with T418899. Awaiting to see what the fix would be.
[20:21:00] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T419635)', diff saved to https://phabricator.wikimedia.org/P90376 and previous config saved to /var/cache/conftool/dbconfig/20260410-202059-fceratto.json
[20:21:06] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[20:30:43] <wikibugs>	 (03PS1) 10Bking: cloudelastic: remove logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/1270082 (https://phabricator.wikimedia.org/T324335)
[20:31:49] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P90377 and previous config saved to /var/cache/conftool/dbconfig/20260410-203147-fceratto.json
[20:31:56] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270082 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[20:40:12] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic: remove logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/1270082 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking)
[20:42:38] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P90378 and previous config saved to /var/cache/conftool/dbconfig/20260410-204236-fceratto.json
[20:48:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: opensearch-disable-readahead-cloudelastic-chi-eqiad.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:53:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: nginx.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:53:25] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T419635)', diff saved to https://phabricator.wikimedia.org/P90380 and previous config saved to /var/cache/conftool/dbconfig/20260410-205324-fceratto.json
[20:53:29] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[20:53:33] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2210.codfw.wmnet with reason: Maintenance
[20:54:21] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2210 (T419635)', diff saved to https://phabricator.wikimedia.org/P90381 and previous config saved to /var/cache/conftool/dbconfig/20260410-205420-fceratto.json
[20:56:45] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[20:57:47] <inflatador>	 ^^ known
[20:57:59] <inflatador>	 Will set a suppression for this guy
[20:59:26] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cloudelastic1012.eqiad.wmnet with reason: still fixing Puppet
[21:17:55] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[21:17:55] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[21:17:55] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[21:17:56] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[21:17:57] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[21:19:25] <inflatador>	 ^^ known, should clear shortly
[21:20:55] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 85 days) https://wikitech.wikimedia.org/wiki/Search
[21:20:55] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 85 days) https://wikitech.wikimedia.org/wiki/Search
[21:20:55] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 85 days) https://wikitech.wikimedia.org/wiki/Search
[21:20:56] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 85 days) https://wikitech.wikimedia.org/wiki/Search
[21:24:44] <wikibugs>	 (03PS1) 10Bking: nginx tls proxy: remove defunct directive [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860)
[21:25:08] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[21:33:37] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] nginx tls proxy: remove defunct directive [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[21:37:59] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie
[21:38:16] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11809821 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
[21:52:19] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11809846 (10Jclark-ctr) I have not had any luck with getting it to power on.   I will start Monday with pulling parts from decom servers to try to get it back up.
[21:52:53] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 27.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:53:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440#11809847 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF Yup, we're using it in production!
[21:53:11] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on db1155 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:53:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:53:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:53:59] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T410589)', diff saved to https://phabricator.wikimedia.org/P90382 and previous config saved to /var/cache/conftool/dbconfig/20260410-215358-ladsgroup.json
[21:54:02] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[21:54:42] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:57:11] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:58:23] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:00:24] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:02:35] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:04:07] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P90383 and previous config saved to /var/cache/conftool/dbconfig/20260410-220406-ladsgroup.json
[22:07:25] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] add beta-logs pki key [labs/private] - 10https://gerrit.wikimedia.org/r/1268683 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[22:07:34] <wikibugs>	 (03CR) 10Cwhite: [V:03+2 C:03+2] add beta-logs pki key [labs/private] - 10https://gerrit.wikimedia.org/r/1268683 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[22:08:53] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[22:10:28] <logmsgbot>	 jclark@cumin1003 provision (PID 1655695) is awaiting input
[22:13:52] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:14:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P90384 and previous config saved to /var/cache/conftool/dbconfig/20260410-221414-ladsgroup.json
[22:17:49] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host phab1006.eqiad.wmnet with OS trixie
[22:24:22] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T410589)', diff saved to https://phabricator.wikimedia.org/P90385 and previous config saved to /var/cache/conftool/dbconfig/20260410-222421-ladsgroup.json
[22:24:25] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[22:24:38] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[22:24:46] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T410589)', diff saved to https://phabricator.wikimedia.org/P90386 and previous config saved to /var/cache/conftool/dbconfig/20260410-222445-ladsgroup.json
[22:26:27] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11809888 (10Jclark-ctr) @Dzahn  next week can you update puppet this is a uefi only server
[22:28:15] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab1006.eqiad.wmnet with OS trixie
[22:30:08] <wikibugs>	 (03PS1) 10Cwhite: logging: add dummy pki "secrets" [labs/private] - 10https://gerrit.wikimedia.org/r/1270089 (https://phabricator.wikimedia.org/T350516)
[22:30:43] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-be1005.eqiad.wmnet with OS bookworm
[22:30:56] <wikibugs>	 (03CR) 10Cwhite: [V:03+2 C:03+2] logging: add dummy pki "secrets" [labs/private] - 10https://gerrit.wikimedia.org/r/1270089 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[22:31:03] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-be1005.eqiad.wmnet with OS bookworm
[22:31:59] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-be1005.eqiad.wmnet with OS bookworm
[22:32:17] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-be1005.eqiad.wmnet with OS bookworm
[22:33:20] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-be1005.eqiad.wmnet with OS bookworm
[22:34:00] <wikibugs>	 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11809908 (10MisterSynergy) >>! In T421642#11785461, @Xqt wrote: > The problems began on March 25th: > {F74901675}  Exact timestamp seems to be shortly after 2026-03-25 1...
[22:40:09] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T419635)', diff saved to https://phabricator.wikimedia.org/P90387 and previous config saved to /var/cache/conftool/dbconfig/20260410-224008-fceratto.json
[22:40:12] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[22:46:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/1/5 (Transport: cr2-codfw:et-0/1/4 (Lumen, 449169461) {#changeme_lumen_patch}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:50:57] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P90388 and previous config saved to /var/cache/conftool/dbconfig/20260410-225055-fceratto.json
[22:52:17] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[22:55:22] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-be1005.eqiad.wmnet with reason: host reimage
[22:59:53] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-be1005.eqiad.wmnet with reason: host reimage
[23:01:44] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P90389 and previous config saved to /var/cache/conftool/dbconfig/20260410-230143-fceratto.json
[23:02:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:05:04] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host apus-be1006.eqiad.wmnet with OS bookworm
[23:12:32] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T419635)', diff saved to https://phabricator.wikimedia.org/P90390 and previous config saved to /var/cache/conftool/dbconfig/20260410-231231-fceratto.json
[23:12:36] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[23:12:51] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2219.codfw.wmnet with reason: Maintenance
[23:13:38] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2219 (T419635)', diff saved to https://phabricator.wikimedia.org/P90391 and previous config saved to /var/cache/conftool/dbconfig/20260410-231337-fceratto.json
[23:16:43] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[23:17:11] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[23:17:13] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be1005.eqiad.wmnet with OS bookworm
[23:18:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:20:02] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-be1006.eqiad.wmnet with reason: host reimage
[23:24:08] <icinga-wm>	 PROBLEM - Host apus-be1005 is DOWN: PING CRITICAL - Packet loss = 100%
[23:25:23] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-be1006.eqiad.wmnet with reason: host reimage
[23:26:34] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11810040 (10Dzahn) @Jclark-ctr Ok, but I'm not sure what kind of update it needs. I never had a UEFI-only server I think.
[23:27:39] <icinga-wm>	 RECOVERY - Host apus-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[23:29:42] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11810051 (10Jclark-ctr) currently preseed.yaml is   `    'pki*|phab*':     - partman/standard.cfg     - partman/raid1-2dev.cfg  ` for efi would need  `    - partman/standard.cf...
[23:33:05] <wikibugs>	 (03PS1) 10Dzahn: zuul: mount /var/ssh/zuul for zuul-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1270103 (https://phabricator.wikimedia.org/T422895)
[23:33:57] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11810070 (10Jclark-ctr) @Dzahn   if you do make any edits please keep in mind of codfw has same server racked needing same update  T418899 phab2003
[23:39:58] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270106
[23:39:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270106 (owner: 10TrainBranchBot)
[23:40:24] <wikibugs>	 (03PS1) 10Dzahn: installserver: set UEFI-only recipes for newer phab* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1270107 (https://phabricator.wikimedia.org/T418905)
[23:40:48] <wikibugs>	 (03PS2) 10Dzahn: installserver: set UEFI-only recipes for newer phab* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1270107 (https://phabricator.wikimedia.org/T418905)
[23:41:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] installserver: set UEFI-only recipes for newer phab* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1270107 (https://phabricator.wikimedia.org/T418905) (owner: 10Dzahn)
[23:47:05] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:48:27] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11810160 (10Dzahn) @Jclark-ctr Gotcha!  I did make the change. It should be for phab1006, phab2003 and future phab* servers with higher numbers. Done.
[23:48:36] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[23:49:33] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:49:35] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[23:49:37] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be1006.eqiad.wmnet with OS bookworm
[23:50:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: mount /var/ssh/zuul for zuul-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1270103 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn)
[23:50:36] <wikibugs>	 (03PS2) 10Dzahn: zuul: mount /var/ssh/zuul for zuul-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1270103 (https://phabricator.wikimedia.org/T422895)
[23:51:17] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270106 (owner: 10TrainBranchBot)
[23:51:28] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:53:46] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:54:54] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:54:58] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: mount /var/ssh/zuul for zuul-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1270103 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn)