[00:02:25] (03Merged) 10jenkins-bot: Api: Remove deprecation warning for missing rvslots [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271085 (https://phabricator.wikimedia.org/T412637) (owner: 10Ladsgroup) [00:07:40] RECOVERY - snapshot of s7 in eqiad on backupmon1001 is OK: Last snapshot for s7 at eqiad (db1171) taken on 2026-04-14 23:12:08 (748 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:10:13] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1271085|Api: Remove deprecation warning for missing rvslots (T412637)]], [[gerrit:1271086|Api: Remove deprecation warning for missing rvslots (T412637)]] [00:10:21] T412637: Remove support for deprecated revisions without rvslots - https://phabricator.wikimedia.org/T412637 [00:12:05] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1271085|Api: Remove deprecation warning for missing rvslots (T412637)]], [[gerrit:1271086|Api: Remove deprecation warning for missing rvslots (T412637)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:13:02] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [00:16:55] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271085|Api: Remove deprecation warning for missing rvslots (T412637)]], [[gerrit:1271086|Api: Remove deprecation warning for missing rvslots (T412637)]] (duration: 06m 41s) [00:16:58] T412637: Remove support for deprecated revisions without rvslots - https://phabricator.wikimedia.org/T412637 [00:29:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T410589)', diff saved to https://phabricator.wikimedia.org/P90700 and previous config saved to /var/cache/conftool/dbconfig/20260415-002915-ladsgroup.json [00:29:20] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [00:39:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P90701 and previous config saved to /var/cache/conftool/dbconfig/20260415-003923-ladsgroup.json [00:49:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P90702 and previous config saved to /var/cache/conftool/dbconfig/20260415-004932-ladsgroup.json [00:59:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T410589)', diff saved to https://phabricator.wikimedia.org/P90703 and previous config saved to /var/cache/conftool/dbconfig/20260415-005940-ladsgroup.json [00:59:44] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [00:59:57] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270605 (owner: 10TrainBranchBot) [01:00:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2218 (T410589)', diff saved to https://phabricator.wikimedia.org/P90704 and previous config saved to /var/cache/conftool/dbconfig/20260415-010004-ladsgroup.json [01:08:57] (03PS2) 10Andrea Denisse: admin: Add passimacopoulos to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1271117 (https://phabricator.wikimedia.org/T423301) [01:09:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1271133 [01:09:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1271133 (owner: 10TrainBranchBot) [01:18:48] FIRING: KubernetesCalicoDown: wikikube-worker2280.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2280.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:20:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T410589)', diff saved to https://phabricator.wikimedia.org/P90705 and previous config saved to /var/cache/conftool/dbconfig/20260415-012048-ladsgroup.json [01:20:51] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1271133 (owner: 10TrainBranchBot) [01:20:58] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:30:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P90706 and previous config saved to /var/cache/conftool/dbconfig/20260415-013056-ladsgroup.json [01:38:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for - https://phabricator.wikimedia.org/T423301#11822541 (10andrea.denisse) Hi @Passimacopoulos, I added you to the `analytics-privatedata-users` group and created you a Kerberos principal.... [01:41:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P90707 and previous config saved to /var/cache/conftool/dbconfig/20260415-014104-ladsgroup.json [01:41:07] RESOLVED: ProbeDown: Service aqs1026-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1026-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:51:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T410589)', diff saved to https://phabricator.wikimedia.org/P90708 and previous config saved to /var/cache/conftool/dbconfig/20260415-015113-ladsgroup.json [01:51:17] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:51:30] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [01:51:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T410589)', diff saved to https://phabricator.wikimedia.org/P90709 and previous config saved to /var/cache/conftool/dbconfig/20260415-015138-ladsgroup.json [02:01:19] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:03:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:04:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:05:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:07:34] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 14s) [02:08:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:13:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:15:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:16:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:17:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:20:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:25:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:26:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:29:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:30:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:31:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:31:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:41:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:45:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:46:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:00:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:02:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:05:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:06:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:09:53] (03PS1) 10Codename Noreste: lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271206 (https://phabricator.wikimedia.org/T423102) [03:10:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:11:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:18:21] (03PS1) 10Codename Noreste: lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) [03:36:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste) [03:38:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271206 (https://phabricator.wikimedia.org/T423102) (owner: 10Codename Noreste) [03:45:16] (03CR) 10Ryan Kemper: [C:03+2] query_service: Add Prometheus metrics to deadlock remediation (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1262510 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [03:48:26] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:46] PROBLEM - ganeti-noded running on ganeti2049 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [03:54:46] RECOVERY - ganeti-noded running on ganeti2049 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [03:55:25] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:55:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:23:08] (03PS1) 10MusikAnimal: Promote CodeMirror 6 out of beta and use in place of CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271263 (https://phabricator.wikimedia.org/T419332) [05:18:48] FIRING: KubernetesCalicoDown: wikikube-worker2280.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2280.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:30:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:49] (03CR) 10Marostegui: [C:03+1] "<3" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270963 (owner: 10Federico Ceratto) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T0600) [06:13:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:15:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:16:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:19:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:21:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:21:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:22:50] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271448 [06:34:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:39:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:55:31] (03PS1) 10Muehlenhoff: Remove LDAP access for atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1271470 [07:00:04] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:02:36] (03PS1) 10Ryan Kemper: opensearch: strip bundled plugins before WMF pkg [puppet] - 10https://gerrit.wikimedia.org/r/1271473 (https://phabricator.wikimedia.org/T423327) [07:03:48] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1271470 (owner: 10Muehlenhoff) [07:05:05] (03CR) 10Brouberol: [C:03+1] "Agreed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [07:08:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:09:22] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271473 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [07:09:26] (03Abandoned) 10Arnaudb: mailman: test httpd config before reloading [puppet] - 10https://gerrit.wikimedia.org/r/1270921 (https://phabricator.wikimedia.org/T323208) (owner: 10Arnaudb) [07:09:49] (03CR) 10Arnaudb: [C:03+1] "thanks for offering an alternative, this is indeed better! +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/1271019 (https://phabricator.wikimedia.org/T323208) (owner: 10Dzahn) [07:10:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [07:10:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [07:11:49] (03CR) 10Elukey: [C:03+1] move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 (owner: 10Ayounsi) [07:12:28] (03CR) 10Elukey: [C:03+1] puppet: remove pyrra modules/profiles [puppet] - 10https://gerrit.wikimedia.org/r/1270996 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [07:12:52] (03CR) 10Elukey: [C:03+1] pyrra: remove pyrra/slo/slos dns entries [dns] - 10https://gerrit.wikimedia.org/r/1270995 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [07:13:29] (03CR) 10Elukey: [C:03+1] pyrra: remove configuration for web interface [puppet] - 10https://gerrit.wikimedia.org/r/1270992 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [07:14:30] (03CR) 10Elukey: [C:03+1] "I guess this comes before https://gerrit.wikimedia.org/r/c/operations/puppet/+/1270996 right?" [puppet] - 10https://gerrit.wikimedia.org/r/1270974 (https://phabricator.wikimedia.org/T423307) (owner: 10Herron) [07:17:35] !log discard /srv/log/swift/server.log.1 on thanos-be2006 to free disk space [07:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:22:52] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nicely done" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [07:23:18] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:23:52] !log discard /srv/log/swift/server.log.5.gz on thanos-be2006 to free disk space [07:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2006.codfw.wmnet [07:25:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:26:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:26:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90710 and previous config saved to /var/cache/conftool/dbconfig/20260415-072626-fceratto.json [07:26:30] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:29:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90711 and previous config saved to /var/cache/conftool/dbconfig/20260415-072935-fceratto.json [07:32:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2006.codfw.wmnet [07:35:35] (03PS1) 10Muehlenhoff: Make cn=growthbook-customelevatedaccess managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1271488 (https://phabricator.wikimedia.org/T420688) [07:39:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P90712 and previous config saved to /var/cache/conftool/dbconfig/20260415-073942-fceratto.json [07:48:00] (03CR) 10Ayounsi: kubernetes-generic: Add alerts for BGP failure scenarios. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1269994 (https://phabricator.wikimedia.org/T356877) (owner: 10Blake) [07:48:41] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:49:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P90713 and previous config saved to /var/cache/conftool/dbconfig/20260415-074951-fceratto.json [07:54:58] (03CR) 10Arnaudb: "I've been able to enable logging in journalctl with:" [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [07:55:25] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:55:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:00:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90714 and previous config saved to /var/cache/conftool/dbconfig/20260415-075959-fceratto.json [08:00:04] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:00:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:01:42] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [08:01:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90715 and previous config saved to /var/cache/conftool/dbconfig/20260415-080150-fceratto.json [08:04:13] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.depool: Do not require tmux/screen [cookbooks] - 10https://gerrit.wikimedia.org/r/1270963 (owner: 10Federico Ceratto) [08:04:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90716 and previous config saved to /var/cache/conftool/dbconfig/20260415-080458-fceratto.json [08:07:06] (03Merged) 10jenkins-bot: sre.mysql.depool: Do not require tmux/screen [cookbooks] - 10https://gerrit.wikimedia.org/r/1270963 (owner: 10Federico Ceratto) [08:08:18] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:10:47] (03PS1) 10Elukey: role::cluster::management: add profile to sync firmwares [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) [08:11:21] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [08:13:28] (03CR) 10Arnaudb: "I think it will be harder to debug if we don't write to a greppable logfile, each access generates about 30 lines with `level=debug` for s" [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [08:15:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P90717 and previous config saved to /var/cache/conftool/dbconfig/20260415-081506-fceratto.json [08:17:04] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies fine locally" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1270617 (owner: 10Pppery) [08:21:22] (03CR) 10Arnaudb: "Ah, mybad, it writes in `/var/log/envoy/syslog.log`" [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [08:22:42] (03CR) 10Nikerabbit: [C:03+1] Register ArticleGuidance extension and enable in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270991 (https://phabricator.wikimedia.org/T423295) (owner: 10Sbisson) [08:23:19] (03Abandoned) 10Arnaudb: gerrit: access logging with Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1270951 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [08:25:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P90718 and previous config saved to /var/cache/conftool/dbconfig/20260415-082514-fceratto.json [08:25:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [08:34:01] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2069 [08:35:06] (03PS1) 10Jelto: miscweb: add config environment variables to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271580 (https://phabricator.wikimedia.org/T414405) [08:35:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T419635)', diff saved to https://phabricator.wikimedia.org/P90719 and previous config saved to /var/cache/conftool/dbconfig/20260415-083522-fceratto.json [08:35:27] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:35:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [08:35:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90720 and previous config saved to /var/cache/conftool/dbconfig/20260415-083547-fceratto.json [08:38:05] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:38:35] (03CR) 10Volans: [C:03+1] "LGTM, just be aware that UX wise with a single env that runs multiple commands it could be confusing for some." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:38:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90721 and previous config saved to /var/cache/conftool/dbconfig/20260415-083857-fceratto.json [08:39:02] (03CR) 10Elukey: [C:03+2] tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:39:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:40:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:40:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:41:02] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:43:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:43:49] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:44:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:48:19] Hi all! I have a config change to deploy, see . Since the afternoon backport window is pretty full, I'd like to do it now. Any objections? @Amir1? @urbanecm? [08:49:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P90722 and previous config saved to /var/cache/conftool/dbconfig/20260415-084904-fceratto.json [08:50:48] Emperor: any objections? --^ [08:57:44] ? [08:58:20] (sorry, I'm knee-deep in other debugging right now) [08:59:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P90723 and previous config saved to /var/cache/conftool/dbconfig/20260415-085912-fceratto.json [08:59:20] Emperor: just asking if it would be ok to deploy a config change right now [09:01:03] pass, sorry [09:02:35] Emperor: i didn't mean to ask you to do it, I can do it myself. just making sure I'm not interfering with anything. But if now isn't a good time I can also do it tomorrow. [09:04:45] I'm afraid the answer is I still don't know - I'm too busy in other things to have been paying attention to IRC this morning, and I'm neither on-call nor clinic duty ATM [09:04:46] (03PS1) 10Fabfur: cache::haproxy: small fix in contact info regex [puppet] - 10https://gerrit.wikimedia.org/r/1271593 [09:05:48] (03PS1) 10Elukey: Improve tox and setup's configuration [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) [09:05:54] 06SRE-OnFire, 06Release-Engineering-Team, 10Scap, 06serviceops-deprecated, 07Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#11823203 (10MLechvien-WMF) Incident Follow-up triage here: @... [09:06:03] Emperor: ok, I'll come back later. Good hunting, and sorry for the distraction! [09:06:42] (03CR) 10CI reject: [V:04-1] cache::haproxy: small fix in contact info regex [puppet] - 10https://gerrit.wikimedia.org/r/1271593 (owner: 10Fabfur) [09:07:09] (03CR) 10CI reject: [V:04-1] Improve tox and setup's configuration [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:07:36] (03PS2) 10Fabfur: cache::haproxy: small fix in contact info regex [puppet] - 10https://gerrit.wikimedia.org/r/1271593 [09:07:44] (03CR) 10Blake: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (owner: 10Blake) [09:08:25] 10ops-eqiad, 06DC-Ops: Unresponsive management for clouddb1019.mgmt:22 - https://phabricator.wikimedia.org/T423387 (10phaultfinder) 03NEW [09:09:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90724 and previous config saved to /var/cache/conftool/dbconfig/20260415-090920-fceratto.json [09:09:25] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:09:37] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [09:09:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1178 (T419635)', diff saved to https://phabricator.wikimedia.org/P90725 and previous config saved to /var/cache/conftool/dbconfig/20260415-090945-fceratto.json [09:11:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271593 (owner: 10Fabfur) [09:14:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T419635)', diff saved to https://phabricator.wikimedia.org/P90726 and previous config saved to /var/cache/conftool/dbconfig/20260415-091454-fceratto.json [09:14:59] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:18:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T410589)', diff saved to https://phabricator.wikimedia.org/P90727 and previous config saved to /var/cache/conftool/dbconfig/20260415-091807-ladsgroup.json [09:18:11] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [09:18:48] FIRING: KubernetesCalicoDown: wikikube-worker2280.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2280.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:19:37] (03PS2) 10Elukey: Improve tox and setup's configuration [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) [09:22:54] mvernon@cumin2002 convert-disks (PID 3988567) is awaiting input [09:25:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P90728 and previous config saved to /var/cache/conftool/dbconfig/20260415-092502-fceratto.json [09:26:12] (03PS3) 10Elukey: Improve tox and setup's configuration [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) [09:28:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P90729 and previous config saved to /var/cache/conftool/dbconfig/20260415-092815-ladsgroup.json [09:30:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P90730 and previous config saved to /var/cache/conftool/dbconfig/20260415-093511-fceratto.json [09:35:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS bullseye [09:38:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P90731 and previous config saved to /var/cache/conftool/dbconfig/20260415-093823-ladsgroup.json [09:41:51] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:42:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:43:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:44:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:45:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T419635)', diff saved to https://phabricator.wikimedia.org/P90732 and previous config saved to /var/cache/conftool/dbconfig/20260415-094519-fceratto.json [09:45:23] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:45:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:45:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1192 (T419635)', diff saved to https://phabricator.wikimedia.org/P90733 and previous config saved to /var/cache/conftool/dbconfig/20260415-094544-fceratto.json [09:45:52] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:48:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T410589)', diff saved to https://phabricator.wikimedia.org/P90734 and previous config saved to /var/cache/conftool/dbconfig/20260415-094831-ladsgroup.json [09:48:40] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [09:48:48] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [09:48:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T419635)', diff saved to https://phabricator.wikimedia.org/P90735 and previous config saved to /var/cache/conftool/dbconfig/20260415-094852-fceratto.json [09:49:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1236 (T410589)', diff saved to https://phabricator.wikimedia.org/P90736 and previous config saved to /var/cache/conftool/dbconfig/20260415-094902-ladsgroup.json [09:50:23] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:50:48] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:51:13] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:52:52] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:53:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:53:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [09:53:58] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2280.codfw.wmnet [09:54:03] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271613 [09:54:52] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:55:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:56:34] !log jayme@cumin2002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker2280.codfw.wmnet [09:58:08] !log jayme@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker2280.codfw.wmnet with reason: hardware issues [09:58:42] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271613 (owner: 10Elukey) [09:59:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P90737 and previous config saved to /var/cache/conftool/dbconfig/20260415-095901-fceratto.json [09:59:03] RECOVERY - SSH on wikikube-worker2280 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:59:05] RECOVERY - Host wikikube-worker2280 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [09:59:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1000) [10:00:45] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:01:38] (03PS1) 10Elukey: Upstream release v12.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1271620 [10:01:46] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1271620 (owner: 10Elukey) [10:02:19] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:03:17] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:03:51] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:08:13] !log uploaded spicerack_12.4.0 to apt.wikimedia.org bookworm-wikimedia [10:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:19] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:09:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P90738 and previous config saved to /var/cache/conftool/dbconfig/20260415-100908-fceratto.json [10:09:10] (03PS2) 10Brouberol: Refactor airflow monitor unit tests [alerts] - 10https://gerrit.wikimedia.org/r/1271612 (https://phabricator.wikimedia.org/T411405) [10:10:12] !log upgrade spicerack on cumin nodes [10:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:26] !log jayme@cumin2002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2280.codfw.wmnet [10:10:28] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker2280.codfw.wmnet [10:10:41] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2280.codfw.wmnet [10:10:46] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2280.codfw.wmnet [10:19:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T419635)', diff saved to https://phabricator.wikimedia.org/P90739 and previous config saved to /var/cache/conftool/dbconfig/20260415-101917-fceratto.json [10:19:21] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:19:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [10:19:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1193 (T419635)', diff saved to https://phabricator.wikimedia.org/P90740 and previous config saved to /var/cache/conftool/dbconfig/20260415-101942-fceratto.json [10:22:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T419635)', diff saved to https://phabricator.wikimedia.org/P90741 and previous config saved to /var/cache/conftool/dbconfig/20260415-102250-fceratto.json [10:23:42] (03CR) 10Clément Goubert: [C:03+1] "Since we're evaluating rate limits on windows that are not hourly (minutes), Daniel removed the hardcoding of the rate limit UNIT in `pref" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [10:24:41] (03PS1) 10Elukey: ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) [10:29:24] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2069.codfw.wmnet with OS bullseye [10:29:24] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2069 [10:31:01] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11823622 (10MatthewVernon) Update - today I performed the convert-disks and reimage process on ms-be2069 but without a firmware upgrade, i.e. ` sudo cookbook sre... [10:32:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P90742 and previous config saved to /var/cache/conftool/dbconfig/20260415-103258-fceratto.json [10:33:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. boot-complete.target would be an alternative, but systemd-udev-settle.service should also work fine." [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1270927 (owner: 10Elukey) [10:36:02] (03CR) 10Elukey: [V:03+2 C:03+2] Update the systemd units to wait for udev before starting [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1270927 (owner: 10Elukey) [10:37:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 365 days, 0:00:00 on dborch1001.wikimedia.org with reason: T416582 [10:37:22] T416582: Migrate orchestrator to Trixie - https://phabricator.wikimedia.org/T416582 [10:37:41] (03CR) 10Vgutierrez: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1192934 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [10:39:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS trixie [10:39:44] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11823704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS trixie [10:39:47] (03CR) 10Vgutierrez: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1192917 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [10:42:30] (03CR) 10Majavah: [C:03+2] wikimedia.org: Restore original TTL for dumps [dns] - 10https://gerrit.wikimedia.org/r/1270363 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [10:42:35] !log taavi@dns1004 START - running authdns-update [10:43:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P90743 and previous config saved to /var/cache/conftool/dbconfig/20260415-104306-fceratto.json [10:44:10] !log taavi@dns1004 END - running authdns-update [10:45:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [10:45:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T419961)', diff saved to https://phabricator.wikimedia.org/P90744 and previous config saved to /var/cache/conftool/dbconfig/20260415-104535-fceratto.json [10:53:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T419635)', diff saved to https://phabricator.wikimedia.org/P90745 and previous config saved to /var/cache/conftool/dbconfig/20260415-105314-fceratto.json [10:53:18] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:53:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [10:53:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90746 and previous config saved to /var/cache/conftool/dbconfig/20260415-105338-fceratto.json [10:53:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T419961)', diff saved to https://phabricator.wikimedia.org/P90747 and previous config saved to /var/cache/conftool/dbconfig/20260415-105349-fceratto.json [10:58:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90748 and previous config saved to /var/cache/conftool/dbconfig/20260415-105848-fceratto.json [10:58:53] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:59:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [11:00:05] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1100). [11:01:45] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:03:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P90749 and previous config saved to /var/cache/conftool/dbconfig/20260415-110357-fceratto.json [11:05:00] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11823753 (10MoritzMuehlenhoff) I tried to reproduce these errors with two hosts which formerly had failing Puppet runs (by explicitly running them against depoole... [11:05:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [11:06:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:06:58] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:08:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Passimacopoulos - https://phabricator.wikimedia.org/T423301#11823771 (10Krinkle) [11:08:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P90750 and previous config saved to /var/cache/conftool/dbconfig/20260415-110856-fceratto.json [11:11:23] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:12:07] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:14:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:14:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P90751 and previous config saved to /var/cache/conftool/dbconfig/20260415-111405-fceratto.json [11:14:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:15:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:16:51] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [11:17:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:17:21] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:18:02] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:19:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P90752 and previous config saved to /var/cache/conftool/dbconfig/20260415-111905-fceratto.json [11:20:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:20:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:20:21] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:21:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:22:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:24:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T419961)', diff saved to https://phabricator.wikimedia.org/P90753 and previous config saved to /var/cache/conftool/dbconfig/20260415-112413-fceratto.json [11:24:37] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [11:24:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T419961)', diff saved to https://phabricator.wikimedia.org/P90754 and previous config saved to /var/cache/conftool/dbconfig/20260415-112445-fceratto.json [11:25:25] (03PS5) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [11:27:43] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:28:14] (03CR) 10CI reject: [V:04-1] airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [11:29:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T419635)', diff saved to https://phabricator.wikimedia.org/P90755 and previous config saved to /var/cache/conftool/dbconfig/20260415-112913-fceratto.json [11:29:17] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:29:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [11:29:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1214 (T419635)', diff saved to https://phabricator.wikimedia.org/P90756 and previous config saved to /var/cache/conftool/dbconfig/20260415-112937-fceratto.json [11:30:09] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:30:26] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:30:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T419961)', diff saved to https://phabricator.wikimedia.org/P90757 and previous config saved to /var/cache/conftool/dbconfig/20260415-113053-fceratto.json [11:31:21] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:32:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T419635)', diff saved to https://phabricator.wikimedia.org/P90758 and previous config saved to /var/cache/conftool/dbconfig/20260415-113241-fceratto.json [11:37:27] (03PS6) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [11:37:57] (03CR) 10Bodhisattwa: "thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [11:38:05] (03PS6) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [11:38:05] (03PS5) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [11:39:56] (03PS6) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [11:40:09] (03CR) 10CI reject: [V:04-1] airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [11:40:24] (03PS7) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [11:40:33] (03PS7) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [11:41:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P90761 and previous config saved to /var/cache/conftool/dbconfig/20260415-114101-fceratto.json [11:41:22] (03PS1) 10MVernon: hiera: ms-be206[8-9] need new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1271676 (https://phabricator.wikimedia.org/T354872) [11:42:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P90762 and previous config saved to /var/cache/conftool/dbconfig/20260415-114249-fceratto.json [11:46:29] (03CR) 10Jcrespo: [C:03+1] "Regex and commit text agrees" [puppet] - 10https://gerrit.wikimedia.org/r/1271676 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [11:48:14] (03CR) 10MVernon: [C:03+2] hiera: ms-be206[8-9] need new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1271676 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [11:48:41] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P90764 and previous config saved to /var/cache/conftool/dbconfig/20260415-115109-fceratto.json [11:52:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P90765 and previous config saved to /var/cache/conftool/dbconfig/20260415-115257-fceratto.json [11:55:25] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:55:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:58:22] (03CR) 10Atsuko: [C:03+1] Prepare dse-k8s-ctrl servers for ipip migration [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:00:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2069.codfw.wmnet with OS trixie [12:00:56] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11823886 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS tr... [12:01:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T419961)', diff saved to https://phabricator.wikimedia.org/P90766 and previous config saved to /var/cache/conftool/dbconfig/20260415-120117-fceratto.json [12:01:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [12:01:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T419961)', diff saved to https://phabricator.wikimedia.org/P90767 and previous config saved to /var/cache/conftool/dbconfig/20260415-120138-fceratto.json [12:01:55] (03PS2) 10Kamila Součková: Revert "shellbox: Setup shellbox-icu72" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) [12:02:44] (03CR) 10Daniel Kinzler: [C:03+2] "Yea, it's unfortunate... without the time bucket flag, we may end up over-counting hourly limits (because we may see the value from the cu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270976 (owner: 10Daniel Kinzler) [12:03:03] (03CR) 10Kamila Součková: Revert "shellbox: Setup shellbox-icu72" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [12:03:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T419635)', diff saved to https://phabricator.wikimedia.org/P90768 and previous config saved to /var/cache/conftool/dbconfig/20260415-120305-fceratto.json [12:03:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:03:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance [12:03:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1226 (T419635)', diff saved to https://phabricator.wikimedia.org/P90769 and previous config saved to /var/cache/conftool/dbconfig/20260415-120331-fceratto.json [12:03:45] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11823893 (10MatthewVernon) Fixing the hiera issue (which was causing puppet to try and set up old-style storage still) made the trixie reim... [12:04:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS bullseye [12:04:22] (03CR) 10Atsuko: [C:03+1] "Does it requires `include profile::lvs::realserver::ipip` in `./modules/role/manifests/dse_k8s/master.pp` as per documentation?" [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:04:28] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11823899 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with O... [12:04:38] (03PS1) 10KartikMistry: Update cxserver to 2026-04-14-071531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271686 [12:05:10] (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1271688 [12:05:58] (03CR) 10Atsuko: [C:03+1] "mark as unresolved" [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:06:39] Doing cxserver deployment; minor changes. [12:06:57] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-04-14-071531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271686 (owner: 10KartikMistry) [12:07:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T419635)', diff saved to https://phabricator.wikimedia.org/P90770 and previous config saved to /var/cache/conftool/dbconfig/20260415-120739-fceratto.json [12:08:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T419961)', diff saved to https://phabricator.wikimedia.org/P90771 and previous config saved to /var/cache/conftool/dbconfig/20260415-120851-fceratto.json [12:08:58] (03Merged) 10jenkins-bot: Update cxserver to 2026-04-14-071531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271686 (owner: 10KartikMistry) [12:10:09] (03PS7) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [12:11:29] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [12:11:53] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:12:49] (03CR) 10CI reject: [V:04-1] airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [12:14:12] (03CR) 10Mszwarc: [C:03+1] Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [12:15:16] (03PS8) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [12:17:07] (03CR) 10CI reject: [V:04-1] airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [12:17:27] (03PS2) 10Btullis: Prepare dse-k8s-ctrl servers for ipip migration [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) [12:17:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P90772 and previous config saved to /var/cache/conftool/dbconfig/20260415-121748-fceratto.json [12:18:50] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:19:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P90773 and previous config saved to /var/cache/conftool/dbconfig/20260415-121859-fceratto.json [12:20:14] (03PS1) 10Vgutierrez: cache::contact_info: Ignore invalid patterns containing an @ [puppet] - 10https://gerrit.wikimedia.org/r/1271695 [12:20:54] (03PS2) 10Vgutierrez: cache::contact_info: Ignore invalid patterns containing an @ [puppet] - 10https://gerrit.wikimedia.org/r/1271695 [12:21:11] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations: debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413 (10Clement_Goubert) 03NEW [12:21:41] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:21:56] (03CR) 10Atsuko: [C:03+1] Prepare dse-k8s-ctrl servers for ipip migration [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:22:03] (03CR) 10Muehlenhoff: [C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1271688 (owner: 10Muehlenhoff) [12:22:12] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:22:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [12:22:36] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:22:57] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:23:12] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:23:48] jouncebot: nowandnext [12:23:48] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [12:23:48] In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1300) [12:24:04] (03PS1) 10Dreamy Jazz: VisualEditor hCaptcha: Clear challenge container for new render [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271697 (https://phabricator.wikimedia.org/T423294) [12:24:22] (03CR) 10Btullis: "Thank you. Yes, that makes much more sense now and the PCC output shows many more changes, which is good." [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:24:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271697 (https://phabricator.wikimedia.org/T423294) (owner: 10Dreamy Jazz) [12:25:09] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:25:31] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.update-replication (exit_code=99) [12:25:38] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:25:43] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:25:46] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:25:54] (03PS9) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [12:26:14] !log Updated cxserver to 2026-04-14-071531-production [12:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:21] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:27:33] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:27:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P90774 and previous config saved to /var/cache/conftool/dbconfig/20260415-122756-fceratto.json [12:28:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [12:28:57] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations: debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11824005 (10brouberol) So, one peculiarity with the growthbook image is that the software depends on both node24 and python3.11. The first is bundle in tr... [12:29:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P90775 and previous config saved to /var/cache/conftool/dbconfig/20260415-122907-fceratto.json [12:29:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:30:03] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:30:09] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:30:18] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:31:03] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:31:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:32:02] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:32:10] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:32:24] (03CR) 10KartikMistry: [C:03+1] Register ArticleGuidance extension and enable in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270991 (https://phabricator.wikimedia.org/T423295) (owner: 10Sbisson) [12:32:38] Hi, I have a config patch (^) to deploy soon to enable a new extension in beta, wondering if it's done exactly the same way as production in term of the WikimediaDebug browser extension. [12:33:14] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations: debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11824026 (10brouberol) One thing we could do here is unset `PYTHONPATH` while running the command? ` # PYTHONPATH= debmonitor-client -n -i blah > blah.tmp... [12:33:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11824027 (10Jgreen) 05Open→03Resolved p:05Triage→03Medium All set! [12:33:33] (03PS1) 10Klausman: manifests: Enable iommu=pt kernel parameter for MI300 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1271699 (https://phabricator.wikimedia.org/T421461) [12:33:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:34:03] (03Merged) 10jenkins-bot: VisualEditor hCaptcha: Clear challenge container for new render [extensions/ConfirmEdit] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271697 (https://phabricator.wikimedia.org/T423294) (owner: 10Dreamy Jazz) [12:34:04] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:34:16] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:34:25] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:34:52] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1271697|VisualEditor hCaptcha: Clear challenge container for new render (T423294)]] [12:34:55] T423294: VisualEditor hCaptcha: Visual challenge cut off on mobile devices - https://phabricator.wikimedia.org/T423294 [12:34:56] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:36:17] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:36:28] stephanebisson: Deployments to the beta cluster are applied without scap [12:36:32] (03PS10) 10Btullis: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) [12:36:47] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1271697|VisualEditor hCaptcha: Clear challenge container for new render (T423294)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:36:52] So there isn't a test stage and about 10 mins later the changes are applied to the beta wikis [12:37:08] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:37:16] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:37:22] You would still need to use scap for this change, so I guess you could test in prod that the extension remains uninstalled on prod? [12:37:44] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:37:52] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:38:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T419635)', diff saved to https://phabricator.wikimedia.org/P90776 and previous config saved to /var/cache/conftool/dbconfig/20260415-123803-fceratto.json [12:38:08] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:38:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T410589)', diff saved to https://phabricator.wikimedia.org/P90777 and previous config saved to /var/cache/conftool/dbconfig/20260415-123811-ladsgroup.json [12:38:15] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [12:38:44] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:38:56] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:39:14] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [12:39:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T419961)', diff saved to https://phabricator.wikimedia.org/P90778 and previous config saved to /var/cache/conftool/dbconfig/20260415-123915-fceratto.json [12:39:21] (03CR) 10JMeybohm: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:39:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [12:39:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2193 (T419961)', diff saved to https://phabricator.wikimedia.org/P90779 and previous config saved to /var/cache/conftool/dbconfig/20260415-123937-fceratto.json [12:40:37] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:40:44] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:40:59] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:41:14] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:41:29] Dreamy_Jazz: if I understand: we still need to sync the config repo in production with scap so it's not behind but it gets applied to beta as soon as it is merged, much like the code from MW core and extensions. Am I on the right track? [12:41:30] (03CR) 10Btullis: [C:03+2] Prepare dse-k8s-ctrl servers for ipip migration [puppet] - 10https://gerrit.wikimedia.org/r/1270929 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [12:41:42] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:41:49] Yes, you are correct [12:41:52] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:42:34] (The job to apply updates occurs every 10 minutes AFAIK so it may not be immediately applied to beta wikis) [12:42:42] (03CR) 10Dpogorzelski: [C:03+1] istio: revisit Prometheus buckets for Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [12:43:03] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271697|VisualEditor hCaptcha: Clear challenge container for new render (T423294)]] (duration: 08m 11s) [12:43:07] T423294: VisualEditor hCaptcha: Visual challenge cut off on mobile devices - https://phabricator.wikimedia.org/T423294 [12:43:35] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:43:48] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:44:09] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:44:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:44:35] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:44:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:45:01] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:45:11] (03CR) 10Fabfur: [C:03+1] "Ok for me, see also I6591749e029a4473c48b52d0fec28d3806edeb04 if you want to include the very small "fix" in your patch" [puppet] - 10https://gerrit.wikimedia.org/r/1271695 (owner: 10Vgutierrez) [12:45:12] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:45:19] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:45:27] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:45:54] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [12:46:03] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [12:46:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T419961)', diff saved to https://phabricator.wikimedia.org/P90780 and previous config saved to /var/cache/conftool/dbconfig/20260415-124633-fceratto.json [12:47:10] (03CR) 10Vgutierrez: [C:03+2] "that's already addressed in this CR by escaping `-`" [puppet] - 10https://gerrit.wikimedia.org/r/1271695 (owner: 10Vgutierrez) [12:48:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2069.codfw.wmnet with OS bullseye [12:48:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P90781 and previous config saved to /var/cache/conftool/dbconfig/20260415-124819-ladsgroup.json [12:48:30] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11824085 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye completed: - ms... [12:52:12] (03CR) 10Blake: [C:03+2] service: exclude apus from the switchover. [puppet] - 10https://gerrit.wikimedia.org/r/1269382 (owner: 10Blake) [12:52:32] (03PS1) 10Dpogorzelski: knative-serving: [DRAFT] update chart to 1.21.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271709 [12:53:20] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271448 (owner: 10PipelineBot) [12:54:36] Dreamy_Jazz so we can sync, check for side effects in production, then wait up to 10 minutes for the config change to take effect in beta? [12:54:45] Yeah [12:56:23] (03CR) 10Muehlenhoff: [C:03+2] Make cn=growthbook-customelevatedaccess managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1271488 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff) [12:56:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P90782 and previous config saved to /var/cache/conftool/dbconfig/20260415-125640-fceratto.json [12:58:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P90783 and previous config saved to /var/cache/conftool/dbconfig/20260415-125828-ladsgroup.json [13:00:05] Urbanecm and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1300). nyaa~ [13:00:05] stephanebisson and codenamenoreste: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye [13:00:12] o/ [13:00:23] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11824099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye [13:00:24] kart_ are you ready? [13:01:06] Yes. Let's start. [13:01:14] I can deploy stephanebisson's change. [13:02:05] Go for it [13:02:29] (03PS2) 10Robertsky: siwikitionary: update logo to localised svg version. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) [13:03:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270991 (https://phabricator.wikimedia.org/T423295) (owner: 10Sbisson) [13:03:47] (03CR) 10Robertsky: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [13:04:10] (03Merged) 10jenkins-bot: Register ArticleGuidance extension and enable in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270991 (https://phabricator.wikimedia.org/T423295) (owner: 10Sbisson) [13:04:36] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1270991|Register ArticleGuidance extension and enable in labs (T423295)]] [13:04:40] T423295: Deploy Article Guidance extension to the beta cluster - https://phabricator.wikimedia.org/T423295 [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:27] !log kartik@deploy1003 sbisson, kartik: Backport for [[gerrit:1270991|Register ArticleGuidance extension and enable in labs (T423295)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P90784 and previous config saved to /var/cache/conftool/dbconfig/20260415-130649-fceratto.json [13:07:07] stephanebisson: we can test, if we can! [13:07:16] On it [13:07:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:07:30] !log fceratto@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:08:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:08:24] !log jmm@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker-codfw [13:08:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T410589)', diff saved to https://phabricator.wikimedia.org/P90785 and previous config saved to /var/cache/conftool/dbconfig/20260415-130836-ladsgroup.json [13:08:41] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:08:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2221.codfw.wmnet with reason: Maintenance [13:08:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T410589)', diff saved to https://phabricator.wikimedia.org/P90786 and previous config saved to /var/cache/conftool/dbconfig/20260415-130849-ladsgroup.json [13:08:58] kart_ I think I can confirm that there is no side effect [13:09:43] Special:Version won't add AG yet. [13:10:16] kart_ config will take up to 10 minutes to be applied in beta [13:10:55] oh right. [13:10:59] I'll go ahead then. [13:12:21] stephanebisson: ^ Is that fine? [13:12:35] kart_, yes go ahead [13:12:48] !log kartik@deploy1003 sbisson, kartik: Continuing with sync [13:13:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [13:15:58] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11824193 (10Blake) @Scott_French Does something like https://wikitech.wikimedia.org/wiki/Ceph/Cephadm#Pooli... [13:16:25] (03Abandoned) 10Jgiannelos: rest-gateway: Cleanup PCS endpoint definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138341 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos) [13:16:38] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270991|Register ArticleGuidance extension and enable in labs (T423295)]] (duration: 12m 02s) [13:16:41] PROBLEM - Host wikikube-worker2280 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:42] T423295: Deploy Article Guidance extension to the beta cluster - https://phabricator.wikimedia.org/T423295 [13:16:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T419961)', diff saved to https://phabricator.wikimedia.org/P90787 and previous config saved to /var/cache/conftool/dbconfig/20260415-131657-fceratto.json [13:17:15] stephanebisson: done [13:17:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [13:17:44] kart_ thanks! The config should be applied anytime now... [13:17:46] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:17:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11824216 (10ssingh) Hi @VRiley-WMF: any updates from Dell's side? Thanks! [13:18:03] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:19:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [13:21:33] !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1026.eqiad.wmnet [13:21:34] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1026.eqiad.wmnet [13:23:31] stephanebisson: seems available now.. [13:24:30] (03PS1) 10Muehlenhoff: Make cn=growthbook-readonly managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1271717 (https://phabricator.wikimedia.org/T420688) [13:24:57] (03PS2) 10Muehlenhoff: Make cn=growthbook-readonly managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1271717 (https://phabricator.wikimedia.org/T420688) [13:25:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [13:25:28] (03CR) 10Jelto: "looks mostly good, one nit in-line and I have the same question about the chown. What's the idea of the additional chown?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270863 (https://phabricator.wikimedia.org/T333143) (owner: 10Arnaudb) [13:27:08] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271718 [13:27:22] (03CR) 10Klausman: [C:03+1] rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [13:27:49] (03CR) 10Klausman: [C:03+1] rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [13:28:28] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2006.codfw.wmnet [13:29:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2006.codfw.wmnet [13:30:16] (03CR) 10Clément Goubert: [C:03+1] Revert "shellbox: Setup shellbox-icu72" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [13:34:16] (03PS1) 10Filippo Giunchedi: openstack: set oslo.messaging processname in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/1271719 (https://phabricator.wikimedia.org/T423378) [13:34:32] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2006.codfw.wmnet [13:34:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2006.codfw.wmnet [13:34:40] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2007.codfw.wmnet [13:38:27] (03PS8) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [13:38:47] andre I am requesting to deploy two patches - both are for lbwiki [13:39:20] wait, please disregard I pinged the wrong user... :( [13:39:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2007.codfw.wmnet [13:40:28] wait, let me ping TheresNoTime [13:41:15] (also cc urbanecm as a listed deployer for this window) [13:42:21] (03CR) 10Brouberol: Make cn=growthbook-readonly managed in Bitu (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271717 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff) [13:42:32] codenamenoreste: give me a moment - which patches? [13:42:41] jouncebot: nowandnext [13:42:41] For the next 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1300) [13:42:41] In 0 hour(s) and 17 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1400) [13:43:13] patches 1271215 and 1271206 (cc TheresNoTime) [13:44:22] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for clouddb1019.mgmt:22 - https://phabricator.wikimedia.org/T423387#11824351 (10Jclark-ctr) a:03Jclark-ctr [13:44:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2068.codfw.wmnet with OS bullseye [13:44:29] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11824353 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye completed: - ms... [13:44:52] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2007.codfw.wmnet [13:44:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2007.codfw.wmnet [13:45:00] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2008.codfw.wmnet [13:45:07] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for clouddb1019.mgmt:22 - https://phabricator.wikimedia.org/T423387#11824355 (10Jclark-ctr) this server is failed and in process of decom T423151 [13:45:26] (03CR) 10AOkoth: [C:03+1] miscweb: add config environment variables to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271580 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [13:45:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2008.codfw.wmnet [13:46:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [13:47:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T419961)', diff saved to https://phabricator.wikimedia.org/P90788 and previous config saved to /var/cache/conftool/dbconfig/20260415-134704-fceratto.json [13:47:11] (03CR) 10Brouberol: [C:03+1] "Thank you, this is beautifully done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [13:47:25] (03PS2) 10Codename Noreste: lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271206 (https://phabricator.wikimedia.org/T423102) [13:48:32] (03CR) 10Samtar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271206 (https://phabricator.wikimedia.org/T423102) (owner: 10Codename Noreste) [13:48:52] codenamenoreste: (still looking, can probably deploy these though we will run over the window) [13:49:15] We have 11 mins left (based on where I live) [13:50:03] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286#11824367 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Right, so the problem was that I'd missed one place of setting the new disk layout optio... [13:50:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271206 (https://phabricator.wikimedia.org/T423102) (owner: 10Codename Noreste) [13:50:45] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2008.codfw.wmnet [13:50:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2008.codfw.wmnet [13:50:52] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2009.codfw.wmnet [13:51:22] (03CR) 10Brouberol: [C:03+1] Make cn=growthbook-readonly managed in Bitu (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271717 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff) [13:51:25] (03Merged) 10jenkins-bot: lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271206 (https://phabricator.wikimedia.org/T423102) (owner: 10Codename Noreste) [13:51:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2009.codfw.wmnet [13:51:48] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1271206|lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount (T423102)]] [13:51:52] T423102: [lbwiki] Set wgAutoConfirmCount to 10 - https://phabricator.wikimedia.org/T423102 [13:52:48] (03PS2) 10Codename Noreste: lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) [13:53:06] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11824381 (10MatthewVernon) [13:53:13] (03CR) 10Samtar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste) [13:53:37] (03CR) 10JHathaway: [C:03+2] sysctls: add optional module param to sysctl::parameters [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [13:53:39] !log samtar@deploy1003 samtar, codenamenoreste: Backport for [[gerrit:1271206|lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount (T423102)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:54:22] codenamenoreste: any testing you want to do for ^ ? [13:55:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T419961)', diff saved to https://phabricator.wikimedia.org/P90789 and previous config saved to /var/cache/conftool/dbconfig/20260415-135519-fceratto.json [13:55:21] I can't verify the edit count for the autoconfirmed user group with a visual cue (aside from my Gerrit change) [13:55:36] !log samtar@deploy1003 samtar, codenamenoreste: Continuing with sync [13:55:50] (03PS3) 10Brouberol: Add "elevated airflow scheduler loop time" mopnitor [alerts] - 10https://gerrit.wikimedia.org/r/1271612 (https://phabricator.wikimedia.org/T411405) [13:56:38] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2009.codfw.wmnet [13:56:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2009.codfw.wmnet [13:56:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker-codfw [13:56:52] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11824409 (10isarantopoulos) [13:57:10] (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1271612 (https://phabricator.wikimedia.org/T411405) (owner: 10Brouberol) [13:57:28] (03CR) 10Atsuko: [C:03+1] Add "elevated airflow scheduler loop time" mopnitor [alerts] - 10https://gerrit.wikimedia.org/r/1271612 (https://phabricator.wikimedia.org/T411405) (owner: 10Brouberol) [13:57:43] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [13:57:44] (03CR) 10Brouberol: [C:03+2] Add "elevated airflow scheduler loop time" mopnitor [alerts] - 10https://gerrit.wikimedia.org/r/1271612 (https://phabricator.wikimedia.org/T411405) (owner: 10Brouberol) [13:58:27] (03CR) 10CI reject: [V:04-1] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [13:59:37] (03CR) 10JHathaway: [C:03+2] acme_chief: delete unused files on passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/1192934 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1400) [14:00:19] * TheresNoTime will overrun by 1 more patch ^ [14:00:58] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271724 (https://phabricator.wikimedia.org/T402956) [14:01:24] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-04-07-234729 to 2026-04-10-185247 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271725 (https://phabricator.wikimedia.org/T413729) [14:01:28] (03CR) 10Hnowlan: [C:03+1] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271718 (owner: 10Muehlenhoff) [14:01:29] TheresNoTime: Ack. I'll deploy my service changes. [14:01:36] (03PS1) 10MVernon: swift: restore 3 nodes to rings, drain 2 more for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1271726 (https://phabricator.wikimedia.org/T354872) [14:02:00] (03CR) 10JHathaway: [C:03+2] acme-chief: remove hiera purge guard [puppet] - 10https://gerrit.wikimedia.org/r/1192917 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [14:02:22] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271724 (https://phabricator.wikimedia.org/T402956) (owner: 10Jforrester) [14:03:44] TheresNoTime I tested it on my alternate account and there is nothing on the ContentTranslation interface [14:03:54] no article listings on that interface, I meant [14:04:03] (03PS1) 10Jcrespo: dbbackups: Perform a ro backup & start backing up only the latest 2 clusters [puppet] - 10https://gerrit.wikimedia.org/r/1271728 (https://phabricator.wikimedia.org/T421729) [14:04:21] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271724 (https://phabricator.wikimedia.org/T402956) (owner: 10Jforrester) [14:04:38] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:04:41] 07Puppet, 06Infrastructure-Foundations, 13Patch-For-Review: alert1002.wikimedia.org: Puppet warning of too many entries in /etc/acmecerts/icinga - https://phabricator.wikimedia.org/T401858#11824480 (10jhathaway) 05Open→03Resolved [14:04:42] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:04:46] codenamenoreste: if you're referring to testing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1271215, that's not yet been merged - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1271206 is still in the process of being deployed [14:05:11] oh [14:05:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P90790 and previous config saved to /var/cache/conftool/dbconfig/20260415-140527-fceratto.json [14:05:38] (03PS1) 10Kamila Součková: deployment_server: add dse-k8s-codfw to ::general [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) [14:05:55] (03CR) 10Btullis: [C:03+2] airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [14:06:03] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:06:47] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:06:47] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:06:55] in that case, go ahead with deploying the autoconfirmed edit limit patch [14:07:02] (03PS1) 10Jcrespo: dbbackups: Backup only regularly clusters 32 & 33, the read-write ones [puppet] - 10https://gerrit.wikimedia.org/r/1271730 (https://phabricator.wikimedia.org/T421729) [14:07:05] (03CR) 10Kamila Součková: "I believe this is (one of) what the CI is unhappy about in Id0b0492ff8eebac7ff5c486384a394c8d2b85c53 ." [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:07:12] only testing is required for the ContentTranslation limit patch [14:08:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:42] (03Merged) 10jenkins-bot: airflow: Add a geoip-enabled kubernetes executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270925 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [14:08:46] I hit a failure during deployment. [14:09:02] James_F: is https://phabricator.wikimedia.org/P90791 related to a service deployment? "Comparing release=canary, chart=wmf-stable/mediawiki, namespace=mw-api-int mw-api-int, mw-api-int.codfw.canary, Deployment (apps) has changed:" [14:09:20] Huh. Not from me. [14:09:27] ack [14:09:38] I ran into a different issue in staging my k8s service. [14:10:16] codenamenoreste: as of now, 1271206 is not fully deployed (I'll resolve that as and when), and 1271215 may have to wait for another window/later on [14:11:09] TheresNoTime: want me to take a look? [14:11:37] claime: sure, context is in https://phabricator.wikimedia.org/P90791 / https://spiderpig.wikimedia.org/jobs/1758 [14:11:44] (03CR) 10Jcrespo: [C:03+1] swift: restore 3 nodes to rings, drain 2 more for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1271726 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [14:12:37] presumably the image didn't become ready? [14:13:11] I saw that I was requested to be added to an allow list on 1271727 [14:13:42] codenamenoreste: yes, just means the CI will run when you upload patches now :) [14:14:39] I'll probably note on T423100 that the patch would probably be delayed until later today or possibly tomorrow [14:14:39] T423100: [lbwiki] Limit ContentTranslation to autoconfirmed and confirmed users - https://phabricator.wikimedia.org/T423100 [14:14:45] (ack) [14:14:46] (03PS9) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [14:15:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P90792 and previous config saved to /var/cache/conftool/dbconfig/20260415-141535-fceratto.json [14:15:55] (03PS1) 10Effie Mouzeli: mcrouter: do not checksum configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271736 (https://phabricator.wikimedia.org/T421504) [14:16:41] Raine: claime: Happy to try the deploy again if it was just something transient [14:17:30] (03CR) 10CI reject: [V:04-1] mcrouter: do not checksum configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271736 (https://phabricator.wikimedia.org/T421504) (owner: 10Effie Mouzeli) [14:17:40] TheresNoTime: those are typically not transient [14:17:47] Yeah there's something pretty weird [14:17:56] I noted the fact on T423100 about the patch possibly being delayed [14:18:11] ah (: I won't try again [14:18:17] codenamenoreste: (ack) [14:18:18] That's the image it is *supposed* to update to [14:19:07] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade Kafka to version 3.x - https://phabricator.wikimedia.org/T416669#11824563 (10Daimona) Crosslinking T422842 here per @Ottomata: could this have broken kafka in beta? [14:19:17] (03CR) 10Jcrespo: "Could you please have a look at this and make sure it matches your understanding:" [puppet] - 10https://gerrit.wikimedia.org/r/1271728 (https://phabricator.wikimedia.org/T421729) (owner: 10Jcrespo) [14:19:58] (03CR) 10MVernon: [C:03+2] swift: restore 3 nodes to rings, drain 2 more for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1271726 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [14:21:55] (03PS2) 10Effie Mouzeli: mcrouter: do not checksum configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271736 (https://phabricator.wikimedia.org/T421504) [14:21:57] (03CR) 10Jcrespo: [C:04-2] "Blocked by 1271728 & backup run/archival." [puppet] - 10https://gerrit.wikimedia.org/r/1271730 (https://phabricator.wikimedia.org/T421729) (owner: 10Jcrespo) [14:22:08] Oh looks like the pods didn't start, wth [14:25:36] (03PS1) 10Btullis: Switch the dse-k8s-ctrl service from Weighted Round Robin to Maglev [puppet] - 10https://gerrit.wikimedia.org/r/1271745 (https://phabricator.wikimedia.org/T420437) [14:25:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T419961)', diff saved to https://phabricator.wikimedia.org/P90794 and previous config saved to /var/cache/conftool/dbconfig/20260415-142543-fceratto.json [14:26:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [14:26:07] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11824631 (10MatthewVernon) [14:26:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T419961)', diff saved to https://phabricator.wikimedia.org/P90795 and previous config saved to /var/cache/conftool/dbconfig/20260415-142615-fceratto.json [14:26:16] TheresNoTime: Can you revert that change please so everything is back in an homogeneous state? [14:26:27] claime: ack [14:26:35] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271745 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [14:26:59] No pod managed to start on that release for some reason it's really strange [14:27:01] (03PS1) 10Samtar: Revert "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271748 [14:27:05] Because mw-debug pods started fine [14:27:05] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations: debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11824649 (10brouberol) [14:29:20] wait, was 1271748 suddenly reverted because of an error? [14:29:23] (03CR) 10Samtar: [C:03+2] "reverting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271748 (owner: 10Samtar) [14:29:45] codenamenoreste: yes, as it wasn't deployed due to an error [14:29:48] (03PS1) 10Jforrester: Revert "wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271749 [14:29:54] (03CR) 10Jforrester: [C:03+2] Revert "wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271749 (owner: 10Jforrester) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1430) [14:30:07] It was deployed only to debug servers then failed when hitting prod [14:30:17] I'm still trying to figure out why [14:30:19] I'll mark the task of that patch as stalled... [14:30:29] Don't think it's anything to do with the patch itself [14:30:43] (03Merged) 10jenkins-bot: Revert "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271748 (owner: 10Samtar) [14:31:32] I meant that was because the patch was reverted due to an error, so I will have to mark its relevant task as stalled [14:32:01] (03Merged) 10jenkins-bot: Revert "wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271749 (owner: 10Jforrester) [14:32:05] (03PS2) 10Kamila Součková: deployment_server: add dse-k8s-codfw to ::general [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) [14:32:12] (03CR) 10CI reject: [V:04-1] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:32:31] FIRING: Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:32:36] (03CR) 10CI reject: [V:04-1] deployment_server: add dse-k8s-codfw to ::general [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:32:53] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271750 [14:33:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T419961)', diff saved to https://phabricator.wikimedia.org/P90796 and previous config saved to /var/cache/conftool/dbconfig/20260415-143319-fceratto.json [14:33:22] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271750 (owner: 10PipelineBot) [14:33:37] Ok I think I found the problem, there's a hung pod in that release [14:34:57] I'll manually destroy it, then you should deploy the revert, and we can try to re-deploy the patch again [14:35:10] (03PS3) 10Kamila Součková: deployment_server: add dse-k8s-codfw to ::general [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) [14:35:21] claime: sounds good [14:35:29] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271750 (owner: 10PipelineBot) [14:36:26] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:36:28] I'm going to have to be very impolite with it, it doesn't want to stop at all [14:36:43] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [14:37:10] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:37:24] jayme: it's because of wikikube-worker2280.codfw.wmnet ... [14:37:31] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:37:33] it's hung again [14:38:09] uhf...that was fast [14:38:19] depooling [14:39:11] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2280.codfw.wmnet [14:40:30] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:40:34] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:41:50] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker2280.codfw.wmnet [14:42:00] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:42:06] (03PS1) 10Elukey: Revert "admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271753 [14:42:27] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:42:48] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [14:43:27] thanks clem [14:43:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P90797 and previous config saved to /var/cache/conftool/dbconfig/20260415-144327-fceratto.json [14:43:33] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:43:52] jayme: it fails because it can't evict the pods, since it's a weird state [14:44:03] I need to force delete all the pods on it [14:45:14] Yeah, this morning the scheduler thought there where running pods because it got marked unreachable [14:45:57] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2280.codfw.wmnet [14:46:01] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2280.codfw.wmnet [14:46:06] There. [14:46:21] TheresNoTime: you can proceed with the revert and then retry the backprot [14:46:24] backport* [14:46:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267116 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [14:46:37] claime: deploying the revert now [14:47:22] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1271748|Revert "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount"]] [14:49:18] !log samtar@deploy1003 samtar: Backport for [[gerrit:1271748|Revert "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:49:41] !log samtar@deploy1003 samtar: Continuing with sync [14:49:53] (03PS3) 10Codename Noreste: lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) [14:50:53] (03CR) 10Kamila Součková: "Also updated the docs: https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Clusters/New&diff=prev&oldid=2402411" [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:51:46] (03CR) 10Samtar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste) [14:51:56] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1273.eqiad.wmnet [14:52:08] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1273.eqiad.wmnet [14:52:31] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:53:19] 06SRE, 06Infrastructure-Foundations, 10netops: No not announce OSPF routes in unicast BGP on Nokia SR-Linux - https://phabricator.wikimedia.org/T423430 (10cmooney) 03NEW p:05Triage→03Low [14:53:26] 06SRE, 06Infrastructure-Foundations, 10netops: No not announce OSPF routes in unicast BGP on Nokia SR-Linux - https://phabricator.wikimedia.org/T423430#11824786 (10cmooney) [14:53:34] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271748|Revert "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount"]] (duration: 06m 12s) [14:53:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P90798 and previous config saved to /var/cache/conftool/dbconfig/20260415-145335-fceratto.json [14:54:13] will retry the backport now (by reverting the revert) [14:54:35] (03CR) 10Elukey: [C:03+2] Revert "admin_ng: set cert-manager and cfssl-issuer replicas to 0 in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271753 (owner: 10Elukey) [14:54:58] (03PS1) 10Samtar: Revert^2 "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271760 [14:55:19] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1012.eqiad.wmnet with reason: maintenance [14:55:28] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11824790 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3f14ff3b-2530-4ddb-aa83-2093ba30c178) set by jynus@cum... [14:56:14] RECOVERY - Host wikikube-worker2280 is UP: PING OK - Packet loss = 0%, RTA = 31.56 ms [14:56:19] (03PS2) 10Filippo Giunchedi: openstack: set oslo.messaging processname in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/1271719 (https://phabricator.wikimedia.org/T423378) [14:56:37] !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [14:56:47] !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [14:57:09] !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:57:25] !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:57:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:57:31] RESOLVED: Traffic bill over quota: Alert for device cr2-esams.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:58:11] (03PS1) 10Ottomata: html enrich staging - remove process_async_enabled_default config until it settles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271761 (https://phabricator.wikimedia.org/T421216) [14:58:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271760 (owner: 10Samtar) [14:58:34] (03CR) 10Scott French: [C:03+1] Revert "shellbox: Setup shellbox-icu72" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [14:59:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [14:59:11] 06SRE, 06Infrastructure-Foundations, 10netops: Don't announce OSPF routes in unicast BGP on Nokia SR-Linux - https://phabricator.wikimedia.org/T423430#11824827 (10cmooney) [14:59:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90799 and previous config saved to /var/cache/conftool/dbconfig/20260415-145918-fceratto.json [14:59:23] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:59:23] (03CR) 10Andrew Bogott: [C:03+1] openstack: set oslo.messaging processname in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/1271719 (https://phabricator.wikimedia.org/T423378) (owner: 10Filippo Giunchedi) [14:59:35] (03Merged) 10jenkins-bot: Revert^2 "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271760 (owner: 10Samtar) [14:59:46] (03CR) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:59:59] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1271760|Revert^2 "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount"]] [15:00:07] (03CR) 10Ottomata: [V:03+2 C:03+2] html enrich staging - remove process_async_enabled_default config until it settles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271761 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [15:00:57] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: set oslo.messaging processname in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/1271719 (https://phabricator.wikimedia.org/T423378) (owner: 10Filippo Giunchedi) [15:01:32] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:01:36] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:02:15] hey folks, if there are no objections, i'd like to do a few consecutive backports to reboot the poolcounter servers [15:02:19] !log samtar@deploy1003 samtar: Backport for [[gerrit:1271760|Revert^2 "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:02:34] bjensen: I am just finishing up ^ [15:02:55] TheresNoTime: ah, gotcha! if you don't mind letting me know when you're through, i'd appreciate it :) [15:02:55] !log samtar@deploy1003 samtar: Continuing with sync [15:03:05] bjensen: will do! [15:03:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T419961)', diff saved to https://phabricator.wikimedia.org/P90800 and previous config saved to /var/cache/conftool/dbconfig/20260415-150344-fceratto.json [15:03:52] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11824880 (10jcrespo) Update: I just handed the server to @Papaul , backups (and restores) will be unavailable while maintenance is... [15:04:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [15:04:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T419961)', diff saved to https://phabricator.wikimedia.org/P90801 and previous config saved to /var/cache/conftool/dbconfig/20260415-150415-fceratto.json [15:04:46] codenamenoreste: just for your info, I.. [15:04:49] ah [15:06:54] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271760|Revert^2 "lbwiki: Set minimum requirement of 10 edits for wgAutoConfirmCount"]] (duration: 06m 54s) [15:07:15] claime: yeah that deployment worked now, thank you for your help! [15:07:20] bjensen: all yours! [15:07:26] cheers! [15:07:36] (03PS2) 10Blake: ProductionServices: remove poolcounter1006.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271742 (https://phabricator.wikimedia.org/T420171) [15:07:49] (03CR) 10Blake: [C:03+2] ProductionServices: remove poolcounter1006.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271742 (https://phabricator.wikimedia.org/T420171) (owner: 10Blake) [15:08:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:09:01] (03Merged) 10jenkins-bot: ProductionServices: remove poolcounter1006.eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271742 (https://phabricator.wikimedia.org/T420171) (owner: 10Blake) [15:11:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T419961)', diff saved to https://phabricator.wikimedia.org/P90802 and previous config saved to /var/cache/conftool/dbconfig/20260415-151114-fceratto.json [15:11:33] !log blake@deploy1003 Started scap sync-world: Backport for [[gerrit:1271742|ProductionServices: remove poolcounter1006.eqiad (T420171)]] [15:12:00] !log installing inetutils security updates [15:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:25] !log blake@deploy1003 blake: Backport for [[gerrit:1271742|ProductionServices: remove poolcounter1006.eqiad (T420171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:14:44] !log jmm@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-master-codfw [15:14:46] !log blake@deploy1003 blake: Continuing with sync [15:18:32] !log blake@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271742|ProductionServices: remove poolcounter1006.eqiad (T420171)]] (duration: 06m 59s) [15:18:57] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter1006.eqiad.wmnet [15:19:02] !log installing Dovecot security updates on mx-out* [15:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:36] (03CR) 10Bking: [C:03+2] opensearch: strip bundled plugins before WMF pkg [puppet] - 10https://gerrit.wikimedia.org/r/1271473 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [15:19:37] (03PS1) 10Blake: ProductionServices: remove poolcounter1007.eqiad, add 1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271766 (https://phabricator.wikimedia.org/T420171) [15:19:38] (03CR) 10Scott French: [C:03+1] ProductionServices: remove poolcounter1007.eqiad, add 1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271766 (https://phabricator.wikimedia.org/T420171) (owner: 10Blake) [15:21:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P90803 and previous config saved to /var/cache/conftool/dbconfig/20260415-152122-fceratto.json [15:22:54] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1006.eqiad.wmnet [15:23:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11825010 (10Jclark-ctr) @klausman Could you put in a silence for it? [15:23:09] (03CR) 10Blake: [C:03+2] ProductionServices: remove poolcounter1007.eqiad, add 1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271766 (https://phabricator.wikimedia.org/T420171) (owner: 10Blake) [15:24:02] (03Merged) 10jenkins-bot: ProductionServices: remove poolcounter1007.eqiad, add 1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271766 (https://phabricator.wikimedia.org/T420171) (owner: 10Blake) [15:24:29] !log update & restart envoy on apus frontends T423065 T382824 [15:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:34] T423065: Stuck-hidden file - https://phabricator.wikimedia.org/T423065 [15:24:34] T382824: Cache problems with new Index pages in Wikisource - https://phabricator.wikimedia.org/T382824 [15:24:49] !log blake@deploy1003 Started scap sync-world: Backport for [[gerrit:1271766|ProductionServices: remove poolcounter1007.eqiad, add 1006 (T420171)]] [15:24:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-master-codfw [15:25:51] (03PS1) 10Effie Mouzeli: mw-debug: use new mcrouter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271771 (https://phabricator.wikimedia.org/T420223) [15:26:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11825025 (10klausman) Done & done. [15:26:31] once more with correct tasks 🤦 [15:26:36] !log update & restart envoy on apus frontends T410975 T419637 [15:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:41] !log blake@deploy1003 blake: Backport for [[gerrit:1271766|ProductionServices: remove poolcounter1007.eqiad, add 1006 (T420171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:26:42] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [15:26:43] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [15:27:15] (03PS1) 10Blake: ProductionServices: re-add poolcounter1007.eqiad. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271772 (https://phabricator.wikimedia.org/T420171) [15:27:23] !log blake@deploy1003 blake: Continuing with sync [15:28:03] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271772 (https://phabricator.wikimedia.org/T420171) (owner: 10Blake) [15:30:11] !log update & restart envoy on thanos frontends T410975 T419637 [15:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:43] !log update & restart envoy on ms swift frontends T410975 T419637 [15:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:08] !log blake@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271766|ProductionServices: remove poolcounter1007.eqiad, add 1006 (T420171)]] (duration: 06m 19s) [15:31:20] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter1007.eqiad.wmnet [15:31:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P90804 and previous config saved to /var/cache/conftool/dbconfig/20260415-153130-fceratto.json [15:32:01] (03CR) 10Muehlenhoff: [C:03+2] Make cn=growthbook-readonly managed in Bitu [puppet] - 10https://gerrit.wikimedia.org/r/1271717 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff) [15:32:51] (03CR) 10Elukey: [C:03+1] mw-debug: use new mcrouter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271771 (https://phabricator.wikimedia.org/T420223) (owner: 10Effie Mouzeli) [15:33:42] (03PS1) 10Muehlenhoff: Add missing record for new group [puppet] - 10https://gerrit.wikimedia.org/r/1271776 (https://phabricator.wikimedia.org/T420688) [15:34:03] (03PS4) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) [15:34:51] (03CR) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:35:08] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1007.eqiad.wmnet [15:35:19] (03CR) 10Blake: [C:03+2] ProductionServices: re-add poolcounter1007.eqiad. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271772 (https://phabricator.wikimedia.org/T420171) (owner: 10Blake) [15:35:51] (03CR) 10Elukey: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271699 (https://phabricator.wikimedia.org/T421461) (owner: 10Klausman) [15:36:32] (03Merged) 10jenkins-bot: ProductionServices: re-add poolcounter1007.eqiad. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271772 (https://phabricator.wikimedia.org/T420171) (owner: 10Blake) [15:37:14] !log blake@deploy1003 Started scap sync-world: Backport for [[gerrit:1271772|ProductionServices: re-add poolcounter1007.eqiad. (T420171)]] [15:38:53] (03PS5) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) [15:38:53] (03PS8) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [15:38:53] (03PS8) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [15:38:58] (03CR) 10Elukey: [C:03+1] provision: Workaround Supermicro BIOS to UEFI bug [cookbooks] - 10https://gerrit.wikimedia.org/r/1262196 (https://phabricator.wikimedia.org/T393053) (owner: 10JHathaway) [15:38:59] 06SRE, 10envoy, 06ServiceOps new, 10ServiceOps-Services-Oids: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11825168 (10MatthewVernon) [15:39:09] !log blake@deploy1003 blake: Backport for [[gerrit:1271772|ProductionServices: re-add poolcounter1007.eqiad. (T420171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:39:40] !log blake@deploy1003 blake: Continuing with sync [15:41:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T419961)', diff saved to https://phabricator.wikimedia.org/P90805 and previous config saved to /var/cache/conftool/dbconfig/20260415-154138-fceratto.json [15:42:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [15:42:03] (03PS3) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [15:42:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2192 (T419961)', diff saved to https://phabricator.wikimedia.org/P90806 and previous config saved to /var/cache/conftool/dbconfig/20260415-154210-fceratto.json [15:42:23] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11825211 (10Papaul) Please see below for the the steps on how to use sum to update chassis , board and product information on Super... [15:42:56] (03CR) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [15:43:23] !log blake@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271772|ProductionServices: re-add poolcounter1007.eqiad. (T420171)]] (duration: 06m 09s) [15:43:32] done with my backports, thanks all [15:43:52] (03PS4) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [15:44:02] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11825231 (10Papaul) @jcrespo I am done all yours. This server is power on. Thank you [15:45:21] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Repeated Puppet failures for PetScan - https://phabricator.wikimedia.org/T366141#11825251 (10A_smart_kitten) (belatedly triaging into the #cloud-vps project, but @magnus I assume that it might be good if you could confirm whether or not this remains an issue `:)`) [15:48:25] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11825270 (10Papaul) @Jclark-ctr Netbox is no longer reporting errors on this server , once @jcrespo done putting the server back in... [15:49:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T419961)', diff saved to https://phabricator.wikimedia.org/P90807 and previous config saved to /var/cache/conftool/dbconfig/20260415-154911-fceratto.json [15:54:45] (03PS1) 10JHathaway: kdc: ensure net.netfilter.nf_conntrack_max is updated [puppet] - 10https://gerrit.wikimedia.org/r/1271794 (https://phabricator.wikimedia.org/T407726) [15:55:24] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271794 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [15:55:51] 07Puppet: Add PATCH method to Wmflib::HTTP::Method - https://phabricator.wikimedia.org/T392096#11825327 (10A_smart_kitten) @fabfur Just going through some older tasks, just wanted to check if this should still remain open or not (as its patch is merged)? [15:56:33] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for backup1012.eqiad.wmnet [15:56:33] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for backup1012.eqiad.wmnet [15:59:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P90809 and previous config saved to /var/cache/conftool/dbconfig/20260415-155920-fceratto.json [16:02:50] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11825397 (10jcrespo) 05Open→03Resolved New backups already flowing as usual: ` Terminated Jobs: JobId Level Files... [16:03:18] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11825404 (10jcrespo) 05Resolved→03Open All good from my side. [16:07:36] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 88685072 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:08:36] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2480992 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P90810 and previous config saved to /var/cache/conftool/dbconfig/20260415-160928-fceratto.json [16:19:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T419961)', diff saved to https://phabricator.wikimedia.org/P90813 and previous config saved to /var/cache/conftool/dbconfig/20260415-161936-fceratto.json [16:19:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance [16:21:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11825509 (10VRiley-WMF) I have plugged in 1 QFX-SFP-1GE-T into mr1-eqiad ge-0/0/7 [16:25:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11825529 (10VRiley-WMF) Hey @ssingh thanks for checking. We just got the part in today (it was supposed to be here yesterday). I will be swapping it shortly. I wi... [16:25:05] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [16:25:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T419961)', diff saved to https://phabricator.wikimedia.org/P90814 and previous config saved to /var/cache/conftool/dbconfig/20260415-162513-fceratto.json [16:29:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11825535 (10VRiley-WMF) 05Open→03In progress [16:30:21] (03CR) 10JHathaway: "@mmuhlenhoff@wikimedia.org thoughts on the best way to test this? Can krb2002, *just*, be rebooted?" [puppet] - 10https://gerrit.wikimedia.org/r/1271794 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [16:32:08] (03CR) 10Scott French: "Thanks, Chris! A couple of high-level thoughts." [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:32:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T419961)', diff saved to https://phabricator.wikimedia.org/P90815 and previous config saved to /var/cache/conftool/dbconfig/20260415-163215-fceratto.json [16:33:36] (03PS1) 10Dduvall: jwt_authorizer: Support jwt-authorizer 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1271808 (https://phabricator.wikimedia.org/T346331) [16:34:07] (03CR) 10CI reject: [V:04-1] jwt_authorizer: Support jwt-authorizer 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1271808 (https://phabricator.wikimedia.org/T346331) (owner: 10Dduvall) [16:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:52] (03PS2) 10Btullis: Switch the dse-k8s-ctrl service from Weighted Round Robin to Maglev [puppet] - 10https://gerrit.wikimedia.org/r/1271745 (https://phabricator.wikimedia.org/T420437) [16:37:18] (03PS2) 10Dduvall: jwt_authorizer: Support jwt-authorizer 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1271808 (https://phabricator.wikimedia.org/T346331) [16:38:18] (03PS1) 10Andrew Bogott: Openstack designate: experiment with zookeeper instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/1271809 [16:38:48] (03CR) 10CI reject: [V:04-1] Openstack designate: experiment with zookeeper instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [16:40:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [16:41:06] (03PS1) 10Btullis: airflow: Only mount geoip volumes for certain instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271811 (https://phabricator.wikimedia.org/T405509) [16:42:20] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11825642 (10Jclark-ctr) 05Open→03Resolved [16:42:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P90816 and previous config saved to /var/cache/conftool/dbconfig/20260415-164223-fceratto.json [16:46:16] !nowandnext [16:46:28] jouncebot: nowandnext [16:46:28] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [16:46:29] In 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1700) [16:46:32] * Raine one day... [16:47:06] (03CR) 10Ladsgroup: "My patches should do what you're describing here. You can double check it via write time of the ibd files for those tables but from what I" [puppet] - 10https://gerrit.wikimedia.org/r/1271728 (https://phabricator.wikimedia.org/T421729) (owner: 10Jcrespo) [16:47:11] (03CR) 10Ladsgroup: [C:03+1] dbbackups: Perform a ro backup & start backing up only the latest 2 clusters [puppet] - 10https://gerrit.wikimedia.org/r/1271728 (https://phabricator.wikimedia.org/T421729) (owner: 10Jcrespo) [16:48:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11825655 (10Jclark-ctr) 05Open→03Resolved [16:49:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Passimacopoulos - https://phabricator.wikimedia.org/T423301#11825657 (10Passimacopoulos) Thanks Andrea, I can confirm I can access it, and have changed my Kerberos password. Many thanks for your support so... [16:51:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270472 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [16:51:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270470 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [16:51:55] (03PS2) 10Andrew Bogott: Openstack designate: experiment with zookeeper instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/1271809 [16:51:55] (03CR) 10JHathaway: role::cluster::management: add profile to sync firmwares (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [16:52:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P90817 and previous config saved to /var/cache/conftool/dbconfig/20260415-165231-fceratto.json [16:52:36] (03Merged) 10jenkins-bot: Revert "Temporarily add shellbox-icu to $wgShellboxUrls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270472 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [16:52:52] (03Merged) 10jenkins-bot: Revert "Enable $wgTempCategoryCollations for testwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270470 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [16:53:17] !log kamila@deploy1003 Started scap sync-world: Backport for [[gerrit:1270472|Revert "Temporarily add shellbox-icu to $wgShellboxUrls" (T422546)]], [[gerrit:1270470|Revert "Enable $wgTempCategoryCollations for testwiki." (T422546)]] [16:53:21] T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546 [16:54:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11825695 (10VRiley-WMF) Part has been installed and it should be good to go. I checked it in iDRAC and it sees both of the drives. Would you be able to check it o... [16:54:39] (03CR) 10CI reject: [V:04-1] Openstack designate: experiment with zookeeper instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [16:55:10] !log kamila@deploy1003 kamila: Backport for [[gerrit:1270472|Revert "Temporarily add shellbox-icu to $wgShellboxUrls" (T422546)]], [[gerrit:1270470|Revert "Enable $wgTempCategoryCollations for testwiki." (T422546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:55:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1700) [17:02:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T419961)', diff saved to https://phabricator.wikimedia.org/P90818 and previous config saved to /var/cache/conftool/dbconfig/20260415-170239-fceratto.json [17:03:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2223.codfw.wmnet with reason: Maintenance [17:03:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T419961)', diff saved to https://phabricator.wikimedia.org/P90819 and previous config saved to /var/cache/conftool/dbconfig/20260415-170310-fceratto.json [17:05:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90820 and previous config saved to /var/cache/conftool/dbconfig/20260415-170501-fceratto.json [17:05:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:05:41] !log kamila@deploy1003 kamila: Continuing with sync [17:06:17] (03PS1) 10Bking: cloudelastic: fix java path typo [puppet] - 10https://gerrit.wikimedia.org/r/1271818 (https://phabricator.wikimedia.org/T422860) [17:08:14] (03PS3) 10Andrew Bogott: Openstack designate: experiment with zookeeper instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/1271809 [17:08:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [17:09:28] !log kamila@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270472|Revert "Temporarily add shellbox-icu to $wgShellboxUrls" (T422546)]], [[gerrit:1270470|Revert "Enable $wgTempCategoryCollations for testwiki." (T422546)]] (duration: 16m 10s) [17:09:32] T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546 [17:10:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T419961)', diff saved to https://phabricator.wikimedia.org/P90821 and previous config saved to /var/cache/conftool/dbconfig/20260415-171011-fceratto.json [17:10:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271818 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:11:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T410589)', diff saved to https://phabricator.wikimedia.org/P90822 and previous config saved to /var/cache/conftool/dbconfig/20260415-171147-ladsgroup.json [17:11:52] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [17:12:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [17:13:12] (03CR) 10Bking: [C:03+2] cloudelastic: fix java path typo [puppet] - 10https://gerrit.wikimedia.org/r/1271818 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [17:14:23] (03PS4) 10Andrew Bogott: Openstack designate: experiment with zookeeper instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/1271809 [17:14:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [17:15:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P90823 and previous config saved to /var/cache/conftool/dbconfig/20260415-171509-fceratto.json [17:19:19] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Repeated Puppet failures for PetScan - https://phabricator.wikimedia.org/T366141#11826138 (10bd808) 05Open→03Declined petscan4 was replaced by petscan5 in mid-2024. If and when something like this happens again the next step in investigating is to ssh int... [17:20:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P90824 and previous config saved to /var/cache/conftool/dbconfig/20260415-172019-fceratto.json [17:20:39] (03PS5) 10Andrew Bogott: Openstack designate: experiment with zookeeper instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/1271809 [17:20:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [17:21:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P90825 and previous config saved to /var/cache/conftool/dbconfig/20260415-172155-ladsgroup.json [17:24:53] (03PS6) 10Andrew Bogott: Openstack designate: experiment with zookeeper instead of memcached [puppet] - 10https://gerrit.wikimedia.org/r/1271809 [17:25:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P90826 and previous config saved to /var/cache/conftool/dbconfig/20260415-172517-fceratto.json [17:27:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [17:28:59] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [17:30:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P90827 and previous config saved to /var/cache/conftool/dbconfig/20260415-173027-fceratto.json [17:30:33] (03CR) 10Scott French: [C:03+2] Set initialDelaySeconds on aqs-http-gateway direct Cassandra clients [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [17:31:40] FYI, I'll be deploying the above ^ to a couple of services in a bit. no mediawiki deployments planned on my end. [17:32:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P90828 and previous config saved to /var/cache/conftool/dbconfig/20260415-173203-ladsgroup.json [17:33:32] (03Merged) 10jenkins-bot: Set initialDelaySeconds on aqs-http-gateway direct Cassandra clients [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270980 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [17:34:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [17:34:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (owner: 10Andrew Bogott) [17:35:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90829 and previous config saved to /var/cache/conftool/dbconfig/20260415-173525-fceratto.json [17:35:30] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:35:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [17:36:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T419635)', diff saved to https://phabricator.wikimedia.org/P90830 and previous config saved to /var/cache/conftool/dbconfig/20260415-173602-fceratto.json [17:36:28] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: apply [17:36:36] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [17:37:08] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply [17:37:16] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [17:37:47] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply [17:37:52] (03PS1) 10Mstyles: Rename Test Kitchen Experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271830 (https://phabricator.wikimedia.org/T420007) [17:37:55] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [17:38:04] (03PS1) 10Mstyles: Rename Test Kitchen Experiment [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271831 (https://phabricator.wikimedia.org/T420007) [17:38:26] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [17:38:34] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [17:39:06] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: apply [17:39:16] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [17:39:47] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [17:39:59] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [17:40:30] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [17:40:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T419961)', diff saved to https://phabricator.wikimedia.org/P90831 and previous config saved to /var/cache/conftool/dbconfig/20260415-174035-fceratto.json [17:40:45] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [17:40:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2228.codfw.wmnet with reason: Maintenance [17:41:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T419961)', diff saved to https://phabricator.wikimedia.org/P90832 and previous config saved to /var/cache/conftool/dbconfig/20260415-174107-fceratto.json [17:41:47] (03PS1) 10Mstyles: Rename Test Kitchen Experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271832 (https://phabricator.wikimedia.org/T420007) [17:42:05] (03PS1) 10Mstyles: Rename Test Kitchen Experiment [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271833 (https://phabricator.wikimedia.org/T420007) [17:42:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T410589)', diff saved to https://phabricator.wikimedia.org/P90833 and previous config saved to /var/cache/conftool/dbconfig/20260415-174212-ladsgroup.json [17:42:16] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [17:42:29] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1253.eqiad.wmnet with reason: Maintenance [17:42:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T410589)', diff saved to https://phabricator.wikimedia.org/P90834 and previous config saved to /var/cache/conftool/dbconfig/20260415-174236-ladsgroup.json [17:43:02] (03PS7) 10Andrew Bogott: Openstack designate: install zookeeper in codfw1dev servers [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (https://phabricator.wikimedia.org/T422646) [17:43:03] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/data-gateway: apply [17:43:22] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [17:43:53] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/device-analytics: apply [17:44:09] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [17:44:40] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [17:44:56] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [17:45:24] (03CR) 10CI reject: [V:04-1] Openstack designate: install zookeeper in codfw1dev servers [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [17:45:28] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [17:45:42] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [17:46:13] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [17:46:27] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [17:46:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271832 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [17:46:58] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/media-analytics: apply [17:47:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271833 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [17:47:13] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [17:47:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271831 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [17:47:44] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/page-analytics: apply [17:47:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271830 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [17:48:02] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [17:48:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T419961)', diff saved to https://phabricator.wikimedia.org/P90835 and previous config saved to /var/cache/conftool/dbconfig/20260415-174808-fceratto.json [17:49:12] (03PS8) 10Andrew Bogott: Openstack designate: install zookeeper in codfw1dev servers [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (https://phabricator.wikimedia.org/T422646) [17:49:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [17:53:43] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [17:53:57] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [17:54:28] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [17:54:43] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [17:55:14] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [17:55:28] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [17:55:59] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [17:56:12] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [17:56:44] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [17:56:51] (03PS1) 10Pppery: Enwikinews: disable lingering FlaggedRevs template processing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271839 (https://phabricator.wikimedia.org/T423512) [17:56:57] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [17:57:08] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [17:57:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11826448 (10VRiley-WMF) @elukey Would you be able to maybe look into these servers as well? I feel like I'm running into the same issue that Jenn was running onto as... [17:57:28] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [17:57:42] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [17:58:13] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [17:58:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P90836 and previous config saved to /var/cache/conftool/dbconfig/20260415-175817-fceratto.json [17:58:28] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [18:00:04] dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T1800). nyaa~ [18:00:13] brett@cumin2002 reimage (PID 139097) is awaiting input [18:00:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271839 (https://phabricator.wikimedia.org/T423512) (owner: 10Pppery) [18:00:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [18:01:27] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [18:01:28] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1115.eqiad.wmnet with OS trixie [18:04:19] (03PS1) 10Ryan Kemper: opensearch: remove dead plugins_dir parameter [puppet] - 10https://gerrit.wikimedia.org/r/1271843 (https://phabricator.wikimedia.org/T423327) [18:04:49] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11826479 (10Ladsgroup) Some progress report: In the past 24 hours, we had 9... [18:08:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P90837 and previous config saved to /var/cache/conftool/dbconfig/20260415-180825-fceratto.json [18:08:33] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271850 (https://phabricator.wikimedia.org/T420482) [18:08:36] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271850 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [18:08:42] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:47] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1115.* [18:09:34] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271850 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [18:09:58] (03PS11) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [18:11:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11826508 (10BCornwall) 05In progress→03Resolved I can confirm it's behaving properly now! Reimage worked just fine and I don't have any kernel errors any... [18:12:48] (03CR) 10Eevans: [C:03+2] aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:13:27] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271843 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [18:15:13] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.24 refs T420482 [18:15:17] T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482 [18:18:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T419961)', diff saved to https://phabricator.wikimedia.org/P90839 and previous config saved to /var/cache/conftool/dbconfig/20260415-181833-fceratto.json [18:21:25] FIRING: [3x] ProbeDown: Service aqs1027-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:23:58] (03PS5) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [18:26:25] FIRING: [4x] ProbeDown: Service aqs1027-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:26:58] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1027.eqiad.wmnet with reason: Bootstrapping — T412830 [18:27:02] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [18:30:20] (03PS6) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [18:31:03] (03CR) 10CDanis: fundraising_data_import maintenance script wrapper & timer (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:31:25] FIRING: [4x] ProbeDown: Service aqs1027-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:31:27] (03CR) 10CDanis: fundraising_data_import maintenance script wrapper & timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:33:19] (03CR) 10Andrew Bogott: [C:03+2] Openstack designate: install zookeeper in codfw1dev servers [puppet] - 10https://gerrit.wikimedia.org/r/1271809 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:36:23] (03CR) 10Jdlrobson: "General migration plan documented here:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) (owner: 10Jdlrobson) [18:43:21] !log rolling back due to steady `Term with languageCode "en" not found` errors (cc T420482) [18:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:25] T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482 [18:43:34] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271863 (https://phabricator.wikimedia.org/T420482) [18:43:37] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271863 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [18:44:52] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271863 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [18:49:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Passimacopoulos - https://phabricator.wikimedia.org/T423301#11826617 (10andrea.denisse) 05In progress→03Resolved >>! In T423301#11825657, @Passimacopoulos wrote: > Thanks Andrea, I can confirm I can... [18:49:57] (03PS3) 10Jdlrobson: Restore PageImages functionality to Wikisources and Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [18:50:30] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.24 refs T420482 [18:50:34] T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482 [18:57:15] PROBLEM - ensure kvm processes are running on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:15] RECOVERY - ensure kvm processes are running on cloudvirt1054 is OK: PROCS OK: 7 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:54] (03CR) 10Ignacio Rodríguez: Restore PageImages functionality to Wikisources and Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [19:06:24] (03CR) 10Jdlrobson: [C:03+1] siwikitionary: update logo to localised svg version. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [19:10:34] (03CR) 10Jdlrobson: Restore PageImages functionality to Wikisources and Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [19:14:01] (03CR) 10Ignacio Rodríguez: Restore PageImages functionality to Wikisources and Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [19:24:34] (03PS1) 10Cwhite: opensearch: add pki_intermediate_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1271879 (https://phabricator.wikimedia.org/T350516) [19:37:01] !log add GRE tunnel to cr1-drmrs towards cr2-eqiad [19:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:41] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T418136#11826793 (10phaultfinder) [19:38:06] !log add GRE tunnel to cr2-eqiad towards cr1-drmrs [19:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:10] dduvall: Sorry about the breakages; potential fixes landing in master now, will then back-port. [19:41:32] 07sre-alert-triage, 06Quality-and-Test-Engineering-Team, 06Test Platform: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422582#11826800 (10A_smart_kitten) Boldly rerouting to #test_platform, judging by the `grafana_folder` label in the task description (& because I believe th... [19:42:14] 07sre-alert-triage, 06Quality-and-Test-Engineering-Team, 06Test Platform: Alert in need of triage: DatasourceNoData - https://phabricator.wikimedia.org/T422581#11826805 (10A_smart_kitten) Boldly rerouting to #test_platform, judging by the `grafana_folder` label in the task description (& because I believe th... [19:42:40] James_F: it happens. :) thanks for the fixes [19:45:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T419635)', diff saved to https://phabricator.wikimedia.org/P90840 and previous config saved to /var/cache/conftool/dbconfig/20260415-194548-fceratto.json [19:45:53] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [19:51:25] FIRING: [3x] ProbeDown: Service aqs1027-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:55:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270583 (https://phabricator.wikimedia.org/T413707) (owner: 10Brian Wolff) [19:55:14] !log add static routes on cr1-drmrs and cr2-eqiad for arelion GRE far-side IPv4 addresses [19:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:43] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:55:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P90841 and previous config saved to /var/cache/conftool/dbconfig/20260415-195556-fceratto.json [19:59:58] Oh screw you too, CI. [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T2000). [20:00:05] maryum, Pppery, and bawolff: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] here [20:00:13] \o/ [20:00:17] here!! [20:00:31] (03CR) 10Cwhite: [C:03+1] opensearch: remove dead plugins_dir parameter [puppet] - 10https://gerrit.wikimedia.org/r/1271843 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [20:00:41] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ipv4 dns names for eqiad-drmrs gre tunnel - cmooney@cumin1003" [20:01:20] (03PS1) 10Mstyles: Force Reauth [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271886 (https://phabricator.wikimedia.org/T419621) [20:01:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271886 (https://phabricator.wikimedia.org/T419621) (owner: 10Mstyles) [20:01:42] planning to use spiderpig [20:02:33] was anyone else planning to get started? [20:03:36] I'm ready to go with spiderpig [20:03:36] i'm going to squeeze into the very end of the window, but seems like you might as well get started [20:03:45] okay great going to start now [20:03:47] cmooney@cumin1003 netbox (PID 1257407) is awaiting input [20:03:55] cscott: I have train-blockers which trump. [20:04:04] oops, never mind me then! [20:04:09] Sorry. :-( [20:04:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271886 (https://phabricator.wikimedia.org/T419621) (owner: 10Mstyles) [20:04:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271830 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:04:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271831 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271833 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:04:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271832 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:05:00] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ipv4 dns names for eqiad-drmrs gre tunnel - cmooney@cumin1003" [20:05:00] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:05:18] I hope we end up billing the Test Kitchen team for the vast amount of work just to rename a piece of software. [20:05:24] (03Merged) 10jenkins-bot: Force Reauth [extensions/WikimediaCustomizations] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271886 (https://phabricator.wikimedia.org/T419621) (owner: 10Mstyles) [20:05:48] (03Merged) 10jenkins-bot: Rename Test Kitchen Experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271830 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:06:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P90842 and previous config saved to /var/cache/conftool/dbconfig/20260415-200605-fceratto.json [20:06:57] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11826902 (10Scott_French) Thanks, @Blake! Two thoughts: First, it might be a good idea to highlight that t... [20:07:31] Ugh it looks like CI flaked for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1271831 which maryum is trying to deploy [20:07:44] crap [20:08:10] Just manually C+2 it and re-trigger spiderpig, it'll Just Work™ [20:08:30] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:08:35] do I need to hit the stop button in spiderpig? [20:08:40] there's a handy "retry" button in spiderpig which will work too [20:08:45] Nah, it'll spot magically. [20:08:49] Oh, a retry button? Fancy. [20:09:03] (03CR) 10CI reject: [V:04-1] Rename Test Kitchen Experiment [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271831 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:09:10] but it might be waiting for the job to 'fully' fail first. you could help it along by aborting jobs maybe [20:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:09:21] yep the retry button worked [20:09:23] And of course the failure is the gerrit-429 error. [20:09:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271831 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:09:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271833 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:09:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271832 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:10:07] failed again [20:10:16] Yes, too much load on CI. [20:10:27] (03PS1) 10C. Scott Ananian: Exclude parser functions from SpecialLintTemplateErrors [extensions/Linter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271889 (https://phabricator.wikimedia.org/T420102) [20:10:30] Everyone is trying to merge lots of things and gerrit is just blocking CI access. [20:10:32] Helpful. [20:10:36] incredibly [20:12:23] (03PS2) 10Brian Wolff: Record file usage from TemplateStyles pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270583 (https://phabricator.wikimedia.org/T413707) [20:12:42] going to retry the retry [20:12:46] not sure what else to do [20:13:15] We can manually C+2 and wait 'til all of them land. [20:13:23] I'll do that [20:13:30] Also I can go just kill off other people's CI jobs. [20:13:33] (03CR) 10Mstyles: [C:03+2] Rename Test Kitchen Experiment [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271831 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:13:59] thank you James_F [20:15:51] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1271833 alone has been running for 11 minutes at this point. [20:15:56] (03CR) 10Eevans: [C:03+2] linked-artifacts: update staging to v1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271083 (https://phabricator.wikimedia.org/T414838) (owner: 10Eevans) [20:16:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T419635)', diff saved to https://phabricator.wikimedia.org/P90843 and previous config saved to /var/cache/conftool/dbconfig/20260415-201613-fceratto.json [20:16:17] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [20:16:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [20:16:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:17:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90844 and previous config saved to /var/cache/conftool/dbconfig/20260415-201700-fceratto.json [20:17:04] (03Merged) 10jenkins-bot: Rename Test Kitchen Experiment [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271833 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:17:08] (03Merged) 10jenkins-bot: Rename Test Kitchen Experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271832 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:17:14] Two down. [20:17:47] Just waiting for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1271831 at this point? [20:17:52] yep [20:17:54] (03Merged) 10jenkins-bot: linked-artifacts: update staging to v1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271083 (https://phabricator.wikimedia.org/T414838) (owner: 10Eevans) [20:17:56] taking forever [20:17:57] Fingers crossed. [20:18:14] Yeah, core patches (and any other repo with lots of dependencies) can take an age. [20:19:16] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:19:22] Made much worse by the new selenium tests in wmf/ added last week, after we banned them last year. [20:19:53] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [20:20:08] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [20:20:16] (03CR) 10C. Scott Ananian: "I still think the first option should be `2 => 180` instead of `0 => 180` so that the folks with 180px as their current preference (1,696," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) (owner: 10Jdlrobson) [20:21:49] James_F should I go ahead and deploy the other ones with spiderpig? [20:21:58] (03Merged) 10jenkins-bot: Rename Test Kitchen Experiment [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1271831 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [20:22:05] There, all present. [20:22:07] Go for it. [20:22:16] just worked so going to go ahead now [20:22:22] (03PS1) 10Jforrester: PageRenderingHandler: Handle Wikibase's OutOfBoundsException for "we don't have a label" [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271894 (https://phabricator.wikimedia.org/T423514) [20:22:58] looks okay now [20:23:10] !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1271886|Force Reauth (T419621)]], [[gerrit:1271830|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271831|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271833|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271832|Rename Test Kitchen Experiment (T420007)]] [20:23:15] T419621: Move site JS reauth code out into WikimediaCustomizations - https://phabricator.wikimedia.org/T419621 [20:23:16] T420007: Measurement plan: Email confirmation banner instrumentation - https://phabricator.wikimedia.org/T420007 [20:23:30] FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:25:04] !log mstyles@deploy1003 mstyles: Backport for [[gerrit:1271886|Force Reauth (T419621)]], [[gerrit:1271830|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271831|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271833|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271832|Rename Test Kitchen Experiment (T420007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now [20:25:04] be verified there. [20:26:36] !log enable ospf on GRE cr1-drmrs <-> cr2-eqiad [20:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:08] !log mstyles@deploy1003 mstyles: Continuing with sync [20:27:13] (03PS1) 10Jforrester: mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271895 (https://phabricator.wikimedia.org/T423311) [20:28:13] (03PS2) 10Jforrester: mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271895 (https://phabricator.wikimedia.org/T423311) [20:29:16] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:30:27] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [20:30:58] !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271886|Force Reauth (T419621)]], [[gerrit:1271830|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271831|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271833|Rename Test Kitchen Experiment (T420007)]], [[gerrit:1271832|Rename Test Kitchen Experiment (T420007)]] (duration: 07m 48s) [20:31:03] T419621: Move site JS reauth code out into WikimediaCustomizations - https://phabricator.wikimedia.org/T419621 [20:31:04] T420007: Measurement plan: Email confirmation banner instrumentation - https://phabricator.wikimedia.org/T420007 [20:32:07] I'm done! [20:33:05] pppery is next. Who is deploying for them? [20:33:29] I am, apparently. Sigh. [20:33:32] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate v6 reverse records for 2a02:ec80:600:fe0a::1/64 - cmooney@cumin1003" [20:33:39] (03PS1) 10Cathal Mooney: Add INCLUDE statement for 2a02:ec80:600:fe0a::/64 snippet file [dns] - 10https://gerrit.wikimedia.org/r/1271899 [20:34:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [20:34:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271839 (https://phabricator.wikimedia.org/T423512) (owner: 10Pppery) [20:34:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270583 (https://phabricator.wikimedia.org/T413707) (owner: 10Brian Wolff) [20:34:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate v6 reverse records for 2a02:ec80:600:fe0a::1/64 - cmooney@cumin1003" [20:34:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:34:44] I don't think the Logos patch is testable since all it does is delete files, and Varnish caching will mean that those files are still reachable until the cache expires. The Wikinews patch is and I'm prepared to test it [20:35:10] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDE statement for 2a02:ec80:600:fe0a::/64 snippet file [dns] - 10https://gerrit.wikimedia.org/r/1271899 (owner: 10Cathal Mooney) [20:35:26] !log cmooney@dns2005 START - running authdns-update [20:35:32] Yeah, no worries on that one. The enwikinews patch is testable though? [20:35:38] Yes, I said so above [20:36:25] (03Merged) 10jenkins-bot: Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [20:36:32] (03PS1) 10Jforrester: PageRenderingHandler: Don't run repo-mode stuff in non-repo world [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271905 (https://phabricator.wikimedia.org/T423515) [20:36:33] (03Merged) 10jenkins-bot: Enwikinews: disable lingering FlaggedRevs template processing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271839 (https://phabricator.wikimedia.org/T423512) (owner: 10Pppery) [20:36:36] !log cmooney@dns2005 END - running authdns-update [20:36:37] (03Merged) 10jenkins-bot: Record file usage from TemplateStyles pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270583 (https://phabricator.wikimedia.org/T413707) (owner: 10Brian Wolff) [20:36:43] (03CR) 10Bking: [C:03+2] opensearch: remove dead plugins_dir parameter [puppet] - 10https://gerrit.wikimedia.org/r/1271843 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [20:36:59] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1268300|Drop 1.5x logos (T246054)]], [[gerrit:1271839|Enwikinews: disable lingering FlaggedRevs template processing (T423512)]], [[gerrit:1270583|Record file usage from TemplateStyles pages (T413707)]] [20:37:06] T246054: Consider dropping the '1.5x' size logos from srcsets - https://phabricator.wikimedia.org/T246054 [20:37:06] T423512: Main Page on en.WN not transcluding specific templates properly - https://phabricator.wikimedia.org/T423512 [20:37:07] T413707: Have TemplateStyles register an image link for any image referenced via url - https://phabricator.wikimedia.org/T413707 [20:37:53] bawolff: Are you around to test? Though not sure how testable it is. [20:37:58] yes, i can test [20:38:24] Awesome. Should be on mw-debug in a minute or so. [20:38:53] !log jforrester@deploy1003 jforrester, bawolff, pppery: Backport for [[gerrit:1268300|Drop 1.5x logos (T246054)]], [[gerrit:1271839|Enwikinews: disable lingering FlaggedRevs template processing (T423512)]], [[gerrit:1270583|Record file usage from TemplateStyles pages (T413707)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:38:56] looking [20:39:39] James_F: Can confirm it works as intended [20:39:44] Excellent. [20:42:06] James_F: are you planning on backporting those WikiLambda changes during this window? [20:42:15] My patch doesn't seem to be working, but it doesn't seem to be causing any harm either [20:42:19] dduvall: That was the plan. [20:42:22] Pppery: OK, let's ship. [20:42:23] !log enable BGP over GRE between cr1-drmrs and cr2-eqiad [20:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:27] !log jforrester@deploy1003 jforrester, bawolff, pppery: Continuing with sync [20:42:40] (03CR) 10Jforrester: [C:03+2] PageRenderingHandler: Handle Wikibase's OutOfBoundsException for "we don't have a label" [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271894 (https://phabricator.wikimedia.org/T423514) (owner: 10Jforrester) [20:42:41] James_F: ok. ty! i'll re-roll train today if there's time after the backport window [20:42:46] (03CR) 10Jforrester: [C:03+2] PageRenderingHandler: Don't run repo-mode stuff in non-repo world [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271905 (https://phabricator.wikimedia.org/T423515) (owner: 10Jforrester) [20:42:58] (03PS1) 10C. Scott Ananian: Pass preferred LanguageConverter variant explicitly instead of implicitly [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271910 (https://phabricator.wikimedia.org/T415435) [20:43:08] dduvall: After this window is my service window, so I'll be around (and deploying but not in MW land). [20:43:18] okie dokie [20:43:24] (03PS1) 10C. Scott Ananian: Make variant into a parser option for parsoid language conversion [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271911 (https://phabricator.wikimedia.org/T415435) [20:46:14] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268300|Drop 1.5x logos (T246054)]], [[gerrit:1271839|Enwikinews: disable lingering FlaggedRevs template processing (T423512)]], [[gerrit:1270583|Record file usage from TemplateStyles pages (T413707)]] (duration: 09m 15s) [20:46:23] T246054: Consider dropping the '1.5x' size logos from srcsets - https://phabricator.wikimedia.org/T246054 [20:46:24] T423512: Main Page on en.WN not transcluding specific templates properly - https://phabricator.wikimedia.org/T423512 [20:46:24] T413707: Have TemplateStyles register an image link for any image referenced via url - https://phabricator.wikimedia.org/T413707 [20:46:52] (03Merged) 10jenkins-bot: PageRenderingHandler: Handle Wikibase's OutOfBoundsException for "we don't have a label" [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271894 (https://phabricator.wikimedia.org/T423514) (owner: 10Jforrester) [20:47:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271894 (https://phabricator.wikimedia.org/T423514) (owner: 10Jforrester) [20:47:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271905 (https://phabricator.wikimedia.org/T423515) (owner: 10Jforrester) [20:48:31] (03CR) 10Ignacio Rodríguez: "this is how i mark things resolved?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [20:48:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Linter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271889 (https://phabricator.wikimedia.org/T420102) (owner: 10C. Scott Ananian) [20:48:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271910 (https://phabricator.wikimedia.org/T415435) (owner: 10C. Scott Ananian) [20:49:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271911 (https://phabricator.wikimedia.org/T415435) (owner: 10C. Scott Ananian) [20:50:37] (03CR) 10Jdlrobson: [C:03+1] Restore PageImages functionality to Wikisources and Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [20:50:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (gre) (185.15.58.151) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:51:25] RESOLVED: ProbeDown: Service aqs1027-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1027-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:51:33] (03Merged) 10jenkins-bot: PageRenderingHandler: Don't run repo-mode stuff in non-repo world [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271905 (https://phabricator.wikimedia.org/T423515) (owner: 10Jforrester) [20:52:01] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1271894|PageRenderingHandler: Handle Wikibase's OutOfBoundsException for "we don't have a label" (T423514)]], [[gerrit:1271905|PageRenderingHandler: Don't run repo-mode stuff in non-repo world (T423515)]] [20:52:07] T423514: OutOfBoundsException: Term with languageCode "en" not found - https://phabricator.wikimedia.org/T423514 [20:52:07] T423515: MediaWiki\Extension\WikiLambda\ZObjectStore::findZLanguageFromCode: [1146] Table 'abstractwiki.wikilambda_zlanguages' doesn't exist - https://phabricator.wikimedia.org/T423515 [20:52:47] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-04-07-234729 to 2026-04-15-195941 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271725 (https://phabricator.wikimedia.org/T413729) [20:52:47] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271915 (https://phabricator.wikimedia.org/T402956) [20:52:49] (03CR) 10CI reject: [V:04-1] Restore PageImages functionality to Wikisources and Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [20:53:57] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1271894|PageRenderingHandler: Handle Wikibase's OutOfBoundsException for "we don't have a label" (T423514)]], [[gerrit:1271905|PageRenderingHandler: Don't run repo-mode stuff in non-repo world (T423515)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:54:07] (03PS6) 10Jdlrobson: Limit and standardize thumbnail options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) [20:54:18] !log jforrester@deploy1003 jforrester: Continuing with sync [20:55:26] dduvall, James_F : I have a fix for Parsoid LanguageConverter support I'd like to backport, but I think all the language converter wikis are group1. It might be easier to test if I wait until the group1 train rolls. [20:55:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (gre) (185.15.58.151) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:56:13] (03CR) 10Jdlrobson: [C:04-2] "Note: Per plan in https://phabricator.wikimedia.org/T376152#11791562 nobody would lose their preference." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) (owner: 10Jdlrobson) [20:56:17] (03PS2) 10Cwhite: opensearch: add pki_intermediate_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1271879 (https://phabricator.wikimedia.org/T350516) [20:57:01] Ack. [20:57:13] My deploy will just end within the window, I think. [20:57:18] jouncebot: nowandnext [20:57:18] For the next 0 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T2000) [20:57:18] In 0 hour(s) and 2 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T2100) [20:57:22] there's another patch (https://gerrit.wikimedia.org/r/c/1271889/) which is low priority, it just needs to be deployed before the next parsoid version is released. It touches i18n, though, so it's going to be a slow deploy. Advice welcome on best time to deploy that. [20:57:32] Not now. :-) [20:57:42] After the train deploy and we find everything is good? [20:57:43] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 108797704 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:58:07] i think if the train is good i'd want to try to back port the higher priority language converter patches. [20:58:09] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271894|PageRenderingHandler: Handle Wikibase's OutOfBoundsException for "we don't have a label" (T423514)]], [[gerrit:1271905|PageRenderingHandler: Don't run repo-mode stuff in non-repo world (T423515)]] (duration: 06m 08s) [20:58:15] T423514: OutOfBoundsException: Term with languageCode "en" not found - https://phabricator.wikimedia.org/T423514 [20:58:15] T423515: MediaWiki\Extension\WikiLambda\ZObjectStore::findZLanguageFromCode: [1146] Table 'abstractwiki.wikilambda_zlanguages' doesn't exist - https://phabricator.wikimedia.org/T423515 [20:58:30] dduvall: Over to you for the train? [20:58:43] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 45320 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:59:23] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271915 (https://phabricator.wikimedia.org/T402956) (owner: 10Jforrester) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T2100) [21:00:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudelastic1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:00:58] FIRING: PuppetFailure: Puppet has failed on cloudelastic1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:01:46] FYI, I'll be applying some pending external-services network policy changes in the background. should be non-disruptive. [21:02:04] !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [21:02:10] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-04-06-224243 to 2026-04-14-215402 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271915 (https://phabricator.wikimedia.org/T402956) (owner: 10Jforrester) [21:02:30] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [21:03:08] !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [21:03:26] FIRING: [14x] SystemdUnitFailed: opensearch-disable-readahead-cloudelastic-chi-eqiad.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:03:30] !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [21:04:00] !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [21:04:12] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:04:26] (03CR) 10Cwhite: [C:03+2] "PCC NOOP in production" [puppet] - 10https://gerrit.wikimedia.org/r/1271879 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [21:04:29] James_F: yep yep [21:04:49] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [21:04:59] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:05:02] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:05:04] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271921 (https://phabricator.wikimedia.org/T420482) [21:05:06] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271921 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [21:05:08] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:05:31] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:05:33] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [21:05:59] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [21:06:15] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:06:18] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cloudelastic1012.eqiad.wmnet with reason: still fixing Puppet [21:06:21] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:06:26] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [21:06:43] (03PS1) 10Cathal Mooney: Add OSPF config for cr1-drmrs <-> cr2-eqiad GRE [homer/public] - 10https://gerrit.wikimedia.org/r/1271923 [21:07:05] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:07:29] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271921 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [21:07:30] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [21:09:16] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:09:17] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-04-07-234729 to 2026-04-15-195941 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271725 (https://phabricator.wikimedia.org/T413729) (owner: 10Jforrester) [21:09:56] (03CR) 10Cathal Mooney: [C:03+2] Add OSPF config for cr1-drmrs <-> cr2-eqiad GRE [homer/public] - 10https://gerrit.wikimedia.org/r/1271923 (owner: 10Cathal Mooney) [21:11:22] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-04-07-234729 to 2026-04-15-195941 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271725 (https://phabricator.wikimedia.org/T413729) (owner: 10Jforrester) [21:11:33] (03Merged) 10jenkins-bot: Add OSPF config for cr1-drmrs <-> cr2-eqiad GRE [homer/public] - 10https://gerrit.wikimedia.org/r/1271923 (owner: 10Cathal Mooney) [21:12:25] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:12:45] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:12:50] (03PS1) 10CDanis: envoyproxy::tls_terminator: request header rewriting [puppet] - 10https://gerrit.wikimedia.org/r/1271926 [21:12:50] (03PS1) 10CDanis: swift::proxy: attempt some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1271927 [21:13:00] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:13:15] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.24 refs T420482 [21:13:19] T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482 [21:13:28] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (owner: 10CDanis) [21:13:33] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:13:41] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:14:18] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:16:35] (03CR) 10CDanis: [V:03+1] "pcc: https://puppet-compiler.wmflabs.org/output/1271927/6413/ms-fe2018.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (owner: 10CDanis) [21:16:56] James_F: i'm seeing https://phabricator.wikimedia.org/T423515 again [21:17:04] Bother. [21:17:12] Can we get an actual trace? [21:17:59] (03PS2) 10CDanis: swift::proxy: attempt some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1271927 [21:18:37] i haven't see one that isn't surfaced as a deprecation warning [21:19:24] Got one. [21:19:26] One sec. [21:20:24] Aha, will write a quick patch. [21:21:17] (03PS1) 10Bking: cloudelastic: temporarily add "working typos" for plugins [puppet] - 10https://gerrit.wikimedia.org/r/1271929 (https://phabricator.wikimedia.org/T423327) [21:21:22] James_F: i think i found one as well. updated the description of https://phabricator.wikimedia.org/T423515 [21:21:46] i'm not going to rollback for this one. we can just patch forward [21:21:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271929 (https://phabricator.wikimedia.org/T423327) (owner: 10Bking) [21:21:49] Ack. [21:26:29] (03PS2) 10Bking: cloudelastic: temporarily add "working typos" for plugins [puppet] - 10https://gerrit.wikimedia.org/r/1271929 (https://phabricator.wikimedia.org/T423327) [21:27:08] (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: temporarily add "working typos" for plugins [puppet] - 10https://gerrit.wikimedia.org/r/1271929 (https://phabricator.wikimedia.org/T423327) (owner: 10Bking) [21:27:20] (03CR) 10Bking: [C:03+2] cloudelastic: temporarily add "working typos" for plugins [puppet] - 10https://gerrit.wikimedia.org/r/1271929 (https://phabricator.wikimedia.org/T423327) (owner: 10Bking) [21:29:15] !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1027.eqiad.wmnet [21:29:16] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1027.eqiad.wmnet [21:29:35] With the changed line items it's possible my first fix did indeed fix the first instance but not the second. Either way, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1271941 should fix that one. [21:33:17] (03PS1) 10Jforrester: wikifunctions: Double the number of evaluators from 2 to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271942 (https://phabricator.wikimedia.org/T419933) [21:56:33] (03PS1) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [21:57:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260415T2200) [22:00:44] (03PS1) 10Jforrester: PageRenderingHandler: Don't run repo-mode lang check in non-repo world either [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271945 (https://phabricator.wikimedia.org/T423515) [22:01:29] dduvall: OK, should I deploy my second fix? [22:01:46] sure! [22:01:49] and ty [22:01:55] (03CR) 10Jforrester: [C:03+2] PageRenderingHandler: Don't run repo-mode lang check in non-repo world either [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271945 (https://phabricator.wikimedia.org/T423515) (owner: 10Jforrester) [22:02:17] And then over to cscott for his back-ports? [22:03:32] (03PS1) 10Ryan Kemper: opensearch: allowlist upstream-only plugins [puppet] - 10https://gerrit.wikimedia.org/r/1271947 (https://phabricator.wikimedia.org/T423327) [22:04:16] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:04:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271947 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [22:04:53] (03PS1) 10Dduvall: zuul: Configure environment variables for http(s) proxy [puppet] - 10https://gerrit.wikimedia.org/r/1271948 (https://phabricator.wikimedia.org/T406384) [22:05:47] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271947 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [22:05:59] (03Merged) 10jenkins-bot: PageRenderingHandler: Don't run repo-mode lang check in non-repo world either [extensions/WikiLambda] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271945 (https://phabricator.wikimedia.org/T423515) (owner: 10Jforrester) [22:06:51] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1271945|PageRenderingHandler: Don't run repo-mode lang check in non-repo world either (T423515)]] [22:06:52] (03PS2) 10Dduvall: zuul: Configure environment variables for http(s) proxy [puppet] - 10https://gerrit.wikimedia.org/r/1271948 (https://phabricator.wikimedia.org/T406384) [22:06:55] T423515: MediaWiki\Extension\WikiLambda\ZObjectStore::findZLanguageFromCode: [1146] Table 'abstractwiki.wikilambda_zlanguages' doesn't exist - https://phabricator.wikimedia.org/T423515 [22:08:31] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:08:44] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1271945|PageRenderingHandler: Don't run repo-mode lang check in non-repo world either (T423515)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:11:48] !log jforrester@deploy1003 jforrester: Continuing with sync [22:12:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90845 and previous config saved to /var/cache/conftool/dbconfig/20260415-221216-fceratto.json [22:12:21] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [22:13:17] dduvall: Hopefully fixed. [22:13:35] (03PS2) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [22:14:04] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:14:43] (03PS3) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [22:15:12] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:15:26] (03PS4) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [22:15:39] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:15:39] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271945|PageRenderingHandler: Don't run repo-mode lang check in non-repo world either (T423515)]] (duration: 08m 48s) [22:15:43] T423515: MediaWiki\Extension\WikiLambda\ZObjectStore::findZLanguageFromCode: [1146] Table 'abstractwiki.wikilambda_zlanguages' doesn't exist - https://phabricator.wikimedia.org/T423515 [22:15:55] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:16:30] (03PS5) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [22:17:00] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:17:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:19:48] (03CR) 10Cwhite: "Something to consider:" [puppet] - 10https://gerrit.wikimedia.org/r/1271947 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [22:20:49] Sorry I had to step away for a few minutes, but if the window is free I'm happy to use ut [22:21:19] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271950 [22:21:36] James_F: or are you still in progress? [22:21:39] Go for it. [22:22:07] (03PS6) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [22:22:20] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:22:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P90846 and previous config saved to /var/cache/conftool/dbconfig/20260415-222225-fceratto.json [22:22:37] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:25:08] 06SRE, 10Infrastructure Security: Consider "inner" and "outer" ssh keys to reduce taps through the day - https://phabricator.wikimedia.org/T422068#11827541 (10colewhite) I wonder if ssh certificates could be of use? I imagine a "tap to access bastion, tap to sign short-lived certificate" then on subsequent lo... [22:26:53] (03PS7) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [22:27:23] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:28:41] (03PS8) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [22:28:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:29:11] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [22:29:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271910 (https://phabricator.wikimedia.org/T415435) (owner: 10C. Scott Ananian) [22:29:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271911 (https://phabricator.wikimedia.org/T415435) (owner: 10C. Scott Ananian) [22:32:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P90847 and previous config saved to /var/cache/conftool/dbconfig/20260415-223233-fceratto.json [22:40:49] (03Merged) 10jenkins-bot: Pass preferred LanguageConverter variant explicitly instead of implicitly [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271910 (https://phabricator.wikimedia.org/T415435) (owner: 10C. Scott Ananian) [22:40:56] (03Merged) 10jenkins-bot: Make variant into a parser option for parsoid language conversion [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271911 (https://phabricator.wikimedia.org/T415435) (owner: 10C. Scott Ananian) [22:41:21] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1271910|Pass preferred LanguageConverter variant explicitly instead of implicitly (T415435)]], [[gerrit:1271911|Make variant into a parser option for parsoid language conversion (T415435)]] [22:41:24] T415435: Add temporary URL request parameter to opt-in to the new Parsoid LanguageConverter implementation - https://phabricator.wikimedia.org/T415435 [22:42:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90848 and previous config saved to /var/cache/conftool/dbconfig/20260415-224241-fceratto.json [22:42:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [22:42:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1238.eqiad.wmnet with reason: Maintenance [22:43:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T419635)', diff saved to https://phabricator.wikimedia.org/P90849 and previous config saved to /var/cache/conftool/dbconfig/20260415-224305-fceratto.json [22:43:11] !log cscott@deploy1003 cscott: Backport for [[gerrit:1271910|Pass preferred LanguageConverter variant explicitly instead of implicitly (T415435)]], [[gerrit:1271911|Make variant into a parser option for parsoid language conversion (T415435)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:53:34] !log cscott@deploy1003 cscott: Continuing with sync [22:57:21] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271910|Pass preferred LanguageConverter variant explicitly instead of implicitly (T415435)]], [[gerrit:1271911|Make variant into a parser option for parsoid language conversion (T415435)]] (duration: 16m 00s) [22:57:25] T415435: Add temporary URL request parameter to opt-in to the new Parsoid LanguageConverter implementation - https://phabricator.wikimedia.org/T415435 [22:58:20] ok, well that deploy is done. if no one else is waiting for anything, I can do the low-priority "long" one (i18n) [22:59:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Linter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271889 (https://phabricator.wikimedia.org/T420102) (owner: 10C. Scott Ananian) [22:59:17] (03PS1) 10Eevans: linked-artifacts: deploy v1.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271958 (https://phabricator.wikimedia.org/T414838) [23:01:29] (03CR) 10Eevans: [C:03+2] linked-artifacts: deploy v1.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271958 (https://phabricator.wikimedia.org/T414838) (owner: 10Eevans) [23:02:10] (03CR) 10RLazarus: [C:03+1] envoyproxy::tls_terminator: request header rewriting [puppet] - 10https://gerrit.wikimedia.org/r/1271926 (owner: 10CDanis) [23:02:34] (03Merged) 10jenkins-bot: Exclude parser functions from SpecialLintTemplateErrors [extensions/Linter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1271889 (https://phabricator.wikimedia.org/T420102) (owner: 10C. Scott Ananian) [23:02:59] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1271889|Exclude parser functions from SpecialLintTemplateErrors (T420102)]] [23:03:03] T420102: Special:LintTemplateErrors shows parser functions without context - https://phabricator.wikimedia.org/T420102 [23:03:26] (03Merged) 10jenkins-bot: linked-artifacts: deploy v1.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271958 (https://phabricator.wikimedia.org/T414838) (owner: 10Eevans) [23:05:13] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [23:05:32] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [23:12:42] (03PS1) 10Bartosz Dziewoński: Move privileged global and local group handling to WikimediaCustomizations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) [23:13:34] (03CR) 10CI reject: [V:04-1] Move privileged global and local group handling to WikimediaCustomizations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) (owner: 10Bartosz Dziewoński) [23:20:01] !log cscott@deploy1003 cscott: Backport for [[gerrit:1271889|Exclude parser functions from SpecialLintTemplateErrors (T420102)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:20:05] T420102: Special:LintTemplateErrors shows parser functions without context - https://phabricator.wikimedia.org/T420102 [23:23:00] !log cscott@deploy1003 cscott: Continuing with sync [23:28:07] (03PS2) 10Bartosz Dziewoński: Move privileged global and local group handling to WikimediaCustomizations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) [23:35:47] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271889|Exclude parser functions from SpecialLintTemplateErrors (T420102)]] (duration: 32m 47s) [23:35:52] T420102: Special:LintTemplateErrors shows parser functions without context - https://phabricator.wikimedia.org/T420102 [23:36:06] ok, done! [23:36:22] 34min for an i18n deploy, not *too* bad. [23:39:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271981 [23:39:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271981 (owner: 10TrainBranchBot) [23:53:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1271981 (owner: 10TrainBranchBot) [23:54:40] (03PS1) 10Eevans: installserver: configure new aqs hosts for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1271985 (https://phabricator.wikimedia.org/T412830) [23:55:45] (03CR) 10Scott French: "Thanks, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)