[00:07:41] PROBLEM - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s7 at eqiad (db1171) taken on 2026-04-12 23:09:24 is 742 GiB, but the previous one was 941 GiB, a change of -21.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:19:35] 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065 (10Pppery) 03NEW [00:34:55] 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065#11812480 (10Pppery) [00:53:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:15:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:17:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:17:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:34:41] !log on gerrit2003 restarted gerrit T423027 [03:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:44] T423027: 2026-04-12 Gerrit Outage (was: DiskSpace) - https://phabricator.wikimedia.org/T423027 [03:43:05] (03CR) 10ArielGlenn: [C:03+1] rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [03:54:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:54:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:55:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:55:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:06:13] (03CR) 10ArielGlenn: "Actually I'd like more clarity :-D Where are the expensive api queries executed, one or the other of those domains or both?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [04:45:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:45:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:46:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:46:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:53:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:46] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11812621 (10Marostegui) Thank you - let me know if I can help [05:11:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 67141960 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:12:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3182952 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:20:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T410589)', diff saved to https://phabricator.wikimedia.org/P90465 and previous config saved to /var/cache/conftool/dbconfig/20260413-052042-ladsgroup.json [05:20:47] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:30:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P90466 and previous config saved to /var/cache/conftool/dbconfig/20260413-053050-ladsgroup.json [05:41:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P90467 and previous config saved to /var/cache/conftool/dbconfig/20260413-054100-ladsgroup.json [05:51:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T410589)', diff saved to https://phabricator.wikimedia.org/P90468 and previous config saved to /var/cache/conftool/dbconfig/20260413-055106-ladsgroup.json [05:51:10] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:51:23] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [05:51:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T410589)', diff saved to https://phabricator.wikimedia.org/P90469 and previous config saved to /var/cache/conftool/dbconfig/20260413-055130-ladsgroup.json [06:00:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:44:35] (03CR) 10Muehlenhoff: "Looks good. Alternatively we could also simply stick with the 3.0 package as maintained by Debian? For the container images we need to les" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [06:52:03] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1269466 (owner: 10Muehlenhoff) [07:00:04] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T0700) [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:05:32] (03PS1) 10Muehlenhoff: Add a new Cumin alias to match hosts which are accessible via kerberized SSH [puppet] - 10https://gerrit.wikimedia.org/r/1270279 [07:09:27] !log installing openssh security updates [07:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:52] (03CR) 10Elukey: "I am totally fine with it, my main question mark is who should help maintaining this. I/F can surely help but a team like Traffic is in a " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [07:12:05] (03PS1) 10Majavah: P:kubernetes: deployment_server: Remove kafka cluster IPv6 flag [puppet] - 10https://gerrit.wikimedia.org/r/1270281 [07:13:06] (03CR) 10Muehlenhoff: "I don't have a real prefence either, just mentioning the option :-)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [07:15:15] (03PS1) 10Majavah: P:wmcs::striker: Remove separate monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/1270282 [07:16:06] (03PS4) 10Majavah: hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979 [07:16:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mirror1001.wikimedia.org [07:17:08] (03CR) 10Majavah: [C:03+2] hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979 (owner: 10Majavah) [07:20:28] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [07:20:53] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [07:21:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [07:21:38] (03PS2) 10Majavah: wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040) [07:21:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:21:44] Deployment aqs-http-gateway-main in editor-analytics at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=editor-analytics&var-deployment=aqs-http-gateway-main - ... [07:21:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:23:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mirror1001.wikimedia.org [07:25:26] (03CR) 10Filippo Giunchedi: [C:03+1] wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [07:34:18] (03PS1) 10Brouberol: growthbook-next: test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270284 (https://phabricator.wikimedia.org/T420781) [07:35:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:35:35] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [07:38:18] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8406/co" [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah) [07:40:27] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [07:55:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:01:28] (03PS1) 10Elukey: profile::pki::intermediates: update debmonitor's public key [puppet] - 10https://gerrit.wikimedia.org/r/1270286 (https://phabricator.wikimedia.org/T420993) [08:06:15] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [08:09:17] (03CR) 10Majavah: [C:03+2] wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [08:09:24] !log taavi@dns1004 START - running authdns-update [08:10:46] !log taavi@dns1004 END - running authdns-update [08:16:34] (03PS1) 10Majavah: wikimedia.org: Restore original TTL for dumps [dns] - 10https://gerrit.wikimedia.org/r/1270363 (https://phabricator.wikimedia.org/T422040) [08:17:06] (03CR) 10Elukey: [C:03+2] cfssl::cert: handle the rotation of the intermediate keypair [puppet] - 10https://gerrit.wikimedia.org/r/1265382 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:19:31] (03PS1) 10Kevin Bazira: istio-proxy: add EnvoyFilter to rewrite KServe batcher error responses for edit-check isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) [08:21:45] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [08:22:34] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P90470 and previous config saved to /var/cache/conftool/dbconfig/20260413-082233-fceratto.json [08:22:39] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:25:49] (03CR) 10MVernon: "Thanks for tagging me on this, but swift no longer uses nginx (and I double-checked on debmonitor.wikimedia.org that I'd not missed any)" [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [08:29:32] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [08:29:46] (03CR) 10Muehlenhoff: [C:03+2] Obsolete airflow-wmde-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1266959 (owner: 10Muehlenhoff) [08:30:20] (03CR) 10Btullis: [C:03+1] "Great, thanks for this." [puppet] - 10https://gerrit.wikimedia.org/r/1269227 (https://phabricator.wikimedia.org/T422778) (owner: 10Marostegui) [08:30:26] (03PS1) 10Tiziano Fogli: alerts/deploy: reload config on correct instance during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1270367 (https://phabricator.wikimedia.org/T406054) [08:38:03] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P90471 and previous config saved to /var/cache/conftool/dbconfig/20260413-083801-fceratto.json [08:38:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:39:43] (03CR) 10Dpogorzelski: [C:03+1] "I understand the problem, we can try it on experimental." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [08:41:38] (03CR) 10Dpogorzelski: [C:03+1] "https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol#inference-response-json-error-object is the more up to " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [08:43:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org [08:48:51] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P90473 and previous config saved to /var/cache/conftool/dbconfig/20260413-084850-fceratto.json [08:48:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org [08:50:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270368 [08:50:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270368 (owner: 10TrainBranchBot) [08:51:40] (03CR) 10Btullis: [C:03+1] growthbook-next: test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270284 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol) [08:52:05] (03CR) 10Brouberol: [C:03+2] growthbook-next: test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270284 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol) [08:53:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:40] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P90474 and previous config saved to /var/cache/conftool/dbconfig/20260413-085938-fceratto.json [09:00:07] (03CR) 10Daniel Kinzler: "On both domains. But wikifunctions isn't routed through the gateway at the moment. It's even running on a separate cluster. It's possible " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [09:01:12] (03CR) 10Clément Goubert: [C:03+1] "Currently, queries to `abstract.wikipedia.org` are executed on the same `mw-api-ext` deployments as other wikis." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [09:03:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270368 (owner: 10TrainBranchBot) [09:05:06] (03PS4) 10Daniel Kinzler: rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) [09:05:13] (03PS3) 10Daniel Kinzler: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 [09:05:19] (03PS4) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) [09:07:35] (03PS1) 10Federico Ceratto: sre.mysql.pool: Handle private tasks exception [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) [09:07:35] (03CR) 10Federico Ceratto: "Do we have a task where we can test this before merging perhaps?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto) [09:08:12] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: adjust expressions for multi-instance compactor [alerts] - 10https://gerrit.wikimedia.org/r/1269673 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [09:08:13] (03CR) 10Kevin Bazira: [C:03+2] "thanks for sharing links to KServe v2 protocol docs. unfortunately, the kserve batcher seems to only support the KServe v1 protocol: https" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [09:09:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno) [09:09:59] (03Merged) 10jenkins-bot: thanos/compact: adjust expressions for multi-instance compactor [alerts] - 10https://gerrit.wikimedia.org/r/1269673 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [09:10:28] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P90476 and previous config saved to /var/cache/conftool/dbconfig/20260413-091027-fceratto.json [09:10:32] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:10:35] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [09:11:24] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P90477 and previous config saved to /var/cache/conftool/dbconfig/20260413-091122-fceratto.json [09:15:39] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Security updates [09:15:39] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [09:15:48] !log root@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [09:15:48] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Security updates [09:16:31] (03Merged) 10jenkins-bot: istio-proxy: add EnvoyFilter to rewrite KServe batcher error responses for edit-check isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [09:17:09] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Security updates [09:17:09] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [09:17:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:17:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Security updates [09:19:06] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Security updates [09:19:06] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [09:19:11] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:19:11] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Security updates [09:19:37] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:25:13] PROBLEM - MariaDB Replica IO: pc1 on pc2011 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1011.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1011.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:26:41] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P90479 and previous config saved to /var/cache/conftool/dbconfig/20260413-092640-fceratto.json [09:26:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:29:00] (03PS4) 10Clément Goubert: haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [09:29:13] RECOVERY - MariaDB Replica IO: pc1 on pc2011 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:32:35] (03CR) 10Fabfur: [C:03+1] haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [09:37:30] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P90480 and previous config saved to /var/cache/conftool/dbconfig/20260413-093729-fceratto.json [09:38:25] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [09:42:58] (03CR) 10Blake: [C:03+2] haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [09:43:13] (03CR) 10Blake: [V:03+2 C:03+2] haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey) [09:47:45] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:48:19] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P90481 and previous config saved to /var/cache/conftool/dbconfig/20260413-094818-fceratto.json [09:49:28] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:55:06] (03PS1) 10Blake: thumbor: upgrade haproxy to 3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270374 (https://phabricator.wikimedia.org/T422926) [09:57:23] (03CR) 10Hnowlan: [C:03+1] alerts/deploy: reload config on correct instance during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1270367 (https://phabricator.wikimedia.org/T406054) (owner: 10Tiziano Fogli) [09:57:51] (03CR) 10Tiziano Fogli: [C:03+2] alerts/deploy: reload config on correct instance during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1270367 (https://phabricator.wikimedia.org/T406054) (owner: 10Tiziano Fogli) [09:59:08] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P90482 and previous config saved to /var/cache/conftool/dbconfig/20260413-095906-fceratto.json [09:59:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:59:15] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [10:00:04] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P90483 and previous config saved to /var/cache/conftool/dbconfig/20260413-100003-fceratto.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1000) [10:00:21] (03CR) 10Hnowlan: [C:03+1] "I don't have capacity to merge/deploy this atm, but at base this looks good - thanks!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1270150 (https://phabricator.wikimedia.org/T290345) (owner: 10TheDJ) [10:00:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:25] (03CR) 10Clément Goubert: [C:03+1] thumbor: upgrade haproxy to 3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270374 (https://phabricator.wikimedia.org/T422926) (owner: 10Blake) [10:01:41] (03CR) 10Blake: [C:03+2] thumbor: upgrade haproxy to 3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270374 (https://phabricator.wikimedia.org/T422926) (owner: 10Blake) [10:02:40] (03CR) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [10:03:54] (03Merged) 10jenkins-bot: thumbor: upgrade haproxy to 3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270374 (https://phabricator.wikimedia.org/T422926) (owner: 10Blake) [10:04:12] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [10:05:43] !log blake@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [10:05:54] !log blake@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:06:21] (03CR) 10MVernon: [C:03+2] apus: add two new storage nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [10:06:39] !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:06:44] (03Merged) 10jenkins-bot: rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [10:07:26] !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [10:09:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1269970 (owner: 10Federico Ceratto) [10:09:20] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:09:24] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:11:49] (03CR) 10Hnowlan: [C:03+2] prometheus: add recording rules for the appservers RED dashboard [puppet] - 10https://gerrit.wikimedia.org/r/1259170 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [10:14:14] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:14:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet [10:14:52] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:15:11] !log blake@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:15:32] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P90484 and previous config saved to /var/cache/conftool/dbconfig/20260413-101530-fceratto.json [10:15:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:15:45] !log blake@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [10:16:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11813595 (10MatthewVernon) 05Resolved→03Open Hi @Jclark-ctr could you take another look at the disks on these two systems, please? There should be 24... [10:19:20] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:19:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet [10:19:42] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:19:51] (03PS4) 10Daniel Kinzler: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 [10:20:05] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 (owner: 10Daniel Kinzler) [10:20:18] (03PS5) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) [10:20:23] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [10:22:34] (03Merged) 10jenkins-bot: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 (owner: 10Daniel Kinzler) [10:22:36] (03Merged) 10jenkins-bot: rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [10:23:06] (03PS1) 10Elukey: _cookbook: fix parallel test failures with pytest-xdist (-n auto) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) [10:26:20] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P90485 and previous config saved to /var/cache/conftool/dbconfig/20260413-102619-fceratto.json [10:26:40] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[3067,3074].esams.wmnet} and A:cp - 9.2.13 upgrade (T422328) [10:28:32] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:29:00] ACKNOWLEDGEMENT - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s7 at eqiad (db1171) taken on 2026-04-12 23:09:24 is 742 GiB, but the previous one was 941 GiB, a change of -21.2 % Jcrespo expected - The acknowledgement expires at: 2026-04-15 10:28:40. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:29:19] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:29:28] (03CR) 10Elukey: "@rcoccioli@wikimedia.org: no-shame-time: I used an AI assistant to navigate the parallel failures since they were really sneaky, but the r" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [10:29:52] (03PS1) 10MVernon: codfw: remove 3 drained ms be nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1270382 (https://phabricator.wikimedia.org/T354872) [10:30:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1270286 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:33:18] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:34:00] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:37:09] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P90486 and previous config saved to /var/cache/conftool/dbconfig/20260413-103707-fceratto.json [10:37:40] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[3067,3074].esams.wmnet} and A:cp - 9.2.13 upgrade (T422328) [10:38:15] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:38:42] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:40:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01) [10:41:06] (03PS1) 10Michael Große: stats: add counters for experiment account creation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) [10:41:13] (03PS1) 10Michael Große: Record TOR account creation failure separately [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270384 (https://phabricator.wikimedia.org/T422283) [10:41:37] (03CR) 10D3r1ck01: "Scheduled for next week on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01) [10:43:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270384 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [10:43:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [10:43:34] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11813664 (10MLechvien-WMF) @Scott_French @Blake can we update the description with the conclusion on what n... [10:44:50] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [10:46:24] (03CR) 10Volans: "Thanks for digging rabbit hole, comments inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [10:47:57] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P90487 and previous config saved to /var/cache/conftool/dbconfig/20260413-104756-fceratto.json [10:48:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:48:05] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [10:48:54] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90488 and previous config saved to /var/cache/conftool/dbconfig/20260413-104852-fceratto.json [10:50:37] (03CR) 10Elukey: _cookbook: fix parallel test failures with pytest-xdist (-n auto) (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [10:51:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [10:55:56] 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1220:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T423009#11813719 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced optic and cable [10:57:50] (03PS1) 10Jforrester: [abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270388 [11:02:09] PROBLEM - Host cirrussearch1103 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:03] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch1103:9290 - https://phabricator.wikimedia.org/T422832#11813771 (10Jclark-ctr) 05Open→03Resolved Both cables were present and inserted, with green lights. Reseated the PSU. cleared errors [11:04:06] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90489 and previous config saved to /var/cache/conftool/dbconfig/20260413-110405-fceratto.json [11:04:09] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:04:54] (03PS1) 10Muehlenhoff: Temporarily depool puppetserver1003/2004 [dns] - 10https://gerrit.wikimedia.org/r/1270408 [11:05:01] RECOVERY - Host cirrussearch1103 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [11:14:54] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P90490 and previous config saved to /var/cache/conftool/dbconfig/20260413-111452-fceratto.json [11:21:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:21:59] Deployment aqs-http-gateway-main in editor-analytics at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=editor-analytics&var-deployment=aqs-http-gateway-main - ... [11:21:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:25:43] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P90491 and previous config saved to /var/cache/conftool/dbconfig/20260413-112541-fceratto.json [11:36:32] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90492 and previous config saved to /var/cache/conftool/dbconfig/20260413-113630-fceratto.json [11:36:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:36:39] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2200.codfw.wmnet with reason: Maintenance [11:36:43] (03CR) 10Muehlenhoff: [C:03+2] Temporarily depool puppetserver1003/2004 [dns] - 10https://gerrit.wikimedia.org/r/1270408 (owner: 10Muehlenhoff) [11:36:48] !log jmm@dns1004 START - running authdns-update [11:38:07] !log jmm@dns1004 END - running authdns-update [11:38:44] (03CR) 10Marostegui: [C:03+2] installservers: Do not format /srv on an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1269227 (https://phabricator.wikimedia.org/T422778) (owner: 10Marostegui) [11:48:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2004.codfw.wmnet [11:49:06] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance [11:49:55] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P90493 and previous config saved to /var/cache/conftool/dbconfig/20260413-114953-fceratto.json [11:49:58] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:54:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2004.codfw.wmnet [11:55:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1003.eqiad.wmnet [11:58:15] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Allow to easily disable puppet-merges temporarily - https://phabricator.wikimedia.org/T423121 (10MoritzMuehlenhoff) 03NEW [12:01:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1003.eqiad.wmnet [12:04:29] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P90494 and previous config saved to /var/cache/conftool/dbconfig/20260413-120428-fceratto.json [12:04:33] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:05:11] (03PS1) 10Muehlenhoff: Revert "Temporarily depool puppetserver1003/2004" [dns] - 10https://gerrit.wikimedia.org/r/1270413 [12:12:07] (03CR) 10Ladsgroup: [C:03+1] codfw: remove 3 drained ms be nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1270382 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [12:15:18] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P90495 and previous config saved to /var/cache/conftool/dbconfig/20260413-121516-fceratto.json [12:19:01] (03PS2) 10Jforrester: [DNM] Make abstractwiki a multi-lingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420) [12:20:19] (03CR) 10Muehlenhoff: [C:03+2] Revert "Temporarily depool puppetserver1003/2004" [dns] - 10https://gerrit.wikimedia.org/r/1270413 (owner: 10Muehlenhoff) [12:20:23] !log jmm@dns1004 START - running authdns-update [12:21:43] !log jmm@dns1004 END - running authdns-update [12:23:17] (03CR) 10MVernon: [C:03+2] codfw: remove 3 drained ms be nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1270382 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [12:26:06] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P90496 and previous config saved to /var/cache/conftool/dbconfig/20260413-122604-fceratto.json [12:26:55] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host clouddb1019.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:29:27] (03PS1) 10Muehlenhoff: mariadb: Migrate section-specific DBA access rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) [12:31:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [12:36:29] (03PS3) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) [12:36:44] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:36:48] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:36:54] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P90497 and previous config saved to /var/cache/conftool/dbconfig/20260413-123653-fceratto.json [12:36:57] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:37:10] (03PS3) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [12:37:13] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [12:37:24] (03PS1) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [12:38:02] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P90498 and previous config saved to /var/cache/conftool/dbconfig/20260413-123801-fceratto.json [12:38:41] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie [12:38:44] (03CR) 10CI reject: [V:04-1] rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [12:38:54] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie [12:40:01] (03PS2) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [12:40:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270388 (owner: 10Jforrester) [12:45:49] (03PS4) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [12:45:49] (03PS3) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [12:47:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1019.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:48:25] (03CR) 10Kamila Součková: "Looking at the pcc diff, the IP addresses changed. I haven't looked into why, but I thought this was a no-functional-change-intended patch" [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah) [12:49:14] (03CR) 10Majavah: [V:03+1] "As far as I can tell that is just a change in how they are ordered?" [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah) [12:52:32] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P90499 and previous config saved to /var/cache/conftool/dbconfig/20260413-125231-fceratto.json [12:52:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:52:36] jclark@cumin1003 reimage (PID 2350535) is awaiting input [12:52:50] jclark@cumin1003 reimage (PID 2361809) is awaiting input [12:53:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:00] (03PS2) 10Anzx: urwikisource: add مصنف (author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) [12:54:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:54:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:54:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) (owner: 10Anzx) [12:55:22] urwikisource 🤔 [12:55:22] https://bash.toolforge.org/quip/AU7VU7Zh6snAnmqnK_td [12:57:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:57:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:58:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1002.wikimedia.org [12:59:38] (03PS1) 10Blake: admin: add Blake's backup SSH key. [puppet] - 10https://gerrit.wikimedia.org/r/1270436 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1300). [13:00:05] aude, Sergi0, MichaelG_WMF, James_F, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:25] Lucas_WMDE: 9/ [13:00:28] o/ [13:00:28] !log installing libnginx-mod-http-lua security updates [13:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:32] * MichaelG_WMF is here [13:00:35] I can deploy [13:00:40] hi [13:00:58] I think aude’s p-personal backport sounds like the most important one, so let’s start the gate-and-submit for that [13:01:02] my patches can go in any order [13:01:04] and then during those 15 minutes deploy config changes [13:01:13] sounds good [13:01:18] (I started looking at anzx’ config change but I’m not done with them yet) [13:01:21] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11814156 (10Ladsgroup) Thanks. I asked around to see if anyone would be willing to take a look. [13:01:25] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host phab1006.eqiad.wmnet with OS trixie [13:01:25] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude) [13:01:32] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11814167 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host phab1006.eqiad.wmnet with OS trixie [13:01:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [13:02:53] (03Merged) 10jenkins-bot: Opt-in new accounts to ReadingLists beta feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [13:03:21] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P90500 and previous config saved to /var/cache/conftool/dbconfig/20260413-130320-fceratto.json [13:03:47] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1270016|Opt-in new accounts to ReadingLists beta feature on pilot wikis (T422833)]] [13:03:50] T422833: Start opting in new accounts on the pilot wikis (arwiki, frwiki, zhwiki, idwiki and viwiki) - https://phabricator.wikimedia.org/T422833 [13:04:13] two of my changes cannot be positively tested (they only affect metrics/stats that will begin collection after these backports are done), but the third one, GrowthSuggestionToneCheck: flag as non-experimental, will be testable. [13:04:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1002.wikimedia.org [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:24] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Bit confusing that this language seems to put the word for “discussion” at the *front* of the talk namespace name (which means that the wo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) (owner: 10Anzx) [13:06:59] Isn't RTL fun? [13:07:05] MichaelG_WMF: should they all be deployed together then? [13:07:13] James_F: !sey [13:07:22] Lucas_WMDE: Doesn't hurt [13:08:00] extra fun when I’m looking at MessagesUr.php in emacs and have no idea if Emacs, tmux, and/or GNOME Terminal are responsible for displaying the RTLness correctly, and if they’re all pals with each other about it or not [13:08:19] Or if two of them both are broken and cancel each other out? [13:08:25] (: [13:09:41] (03CR) 10CI reject: [V:04-1] Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude) [13:10:01] 00:01:32.263 stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/UniversalLanguageSelector/': GnuTLS recv error (-54): Error in the pull function.' [13:10:03] Sigh. [13:10:24] (03CR) 10Jforrester: [C:03+2] "C'mon, CI, we believe in you." [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude) [13:11:54] I love T421827 [13:11:54] T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827 [13:12:05] Indeed. [13:12:27] meanwhile https://spiderpig.wikimedia.org/jobs/1735 has been building image for quite some time 🤔 [13:12:29] * Lucas_WMDE looks [13:13:01] seems like the docker pushes are just taking some time [13:13:02] First deploy of the week, so new base SRE image? [13:13:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:30] https://sal.toolforge.org/production?p=0&q=scap&d= at least. [13:13:32] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on phab1006.eqiad.wmnet with reason: host reimage [13:13:50] good point [13:13:53] 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065#11814190 (10KylieTastic) I have just had the same thing happen at https://en.wikipedia.org/wiki/File:Genie_immediately_after_rescue.jpg [13:14:09] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P90501 and previous config saved to /var/cache/conftool/dbconfig/20260413-131408-fceratto.json [13:14:31] I feel like we might be pushing more images too? (but I don’t have an older log to compare) [13:14:40] `grep docker-pusher /var/lib/spiderpig/scap-image-build-and-push-log` shows five images being pushed [13:14:50] webserver, singleversion, multiversion, singleversion-debug, singleversion-cli [13:15:30] Oh, are the singleversion ones new? [13:15:56] nothing super new in https://gitlab.wikimedia.org/repos/releng/release/-/commits/main/make-container-image/build-images.py and https://gitlab.wikimedia.org/repos/releng/scap/-/commits/master/scap/config.py though, maybe I’m wrong [13:16:04] (via codesearch for “singleversion”) [13:16:28] Nope, https://gitlab.wikimedia.org/repos/releng/scap/-/commit/c3080ce4a87513720b0c5720b53e2dc7f2b3b47e was 7 months ago [13:17:40] (03PS1) 10Muehlenhoff: Temporarily depool puppetserver1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/1270441 [13:17:42] it finished building (after also pushing multiversion-debug and multiversion-cli) [13:17:57] Finally! [13:18:24] sergi0: how risky is your change, more or less? [13:18:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1006.eqiad.wmnet with reason: host reimage [13:19:12] Lucas_WMDE: worst produces another validation error instead of fixing the existing. Low risk I'd say [13:19:41] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-redacteddb1001.eqiad.wmnet with OS bookworm [13:19:49] ok, then we can probably combine it with another config change or two [13:19:55] sgtm [13:19:55] aude: is your config change testable btw? [13:20:04] yes spot checking [13:20:04] (I suspect “not without registering a new account”) [13:20:07] ok [13:20:10] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host an-redacteddb1001 [13:20:11] should be there soon [13:20:13] i can make a test account [13:20:53] (03PS1) 10Kevin Bazira: istio-proxy: fix Lua script in EnvoyFilter to correctly rewrite KServe batcher error responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270442 (https://phabricator.wikimedia.org/T422482) [13:21:09] (03Merged) 10jenkins-bot: Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude) [13:21:16] yay, backport made it through in the meantime [13:21:26] !log lucaswerkmeister-wmde@deploy1003 aude, lucaswerkmeister-wmde: Backport for [[gerrit:1270016|Opt-in new accounts to ReadingLists beta feature on pilot wikis (T422833)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:21:29] T422833: Start opting in new accounts on the pilot wikis (arwiki, frwiki, zhwiki, idwiki and viwiki) - https://phabricator.wikimedia.org/T422833 [13:21:30] so that’s up next, then probably all the other config changes together, then MichaelG_WMF’s backports [13:21:33] aude: please test :) [13:21:53] (I’m judging James_F’s config change to be low risk as well) [13:22:46] (03CR) 10Dpogorzelski: [C:03+1] istio-proxy: fix Lua script in EnvoyFilter to correctly rewrite KServe batcher error responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270442 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [13:23:12] btullis@cumin1003 reimage (PID 2391205) is awaiting input [13:23:52] doesn't seem to opt in new accounts to the beta feature yet, but maybe i have to wait a bit or can do a follow up config change later. [13:24:01] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1019.eqiad.wmnet with OS trixie [13:24:02] otherwise, everything is okay [13:24:03] hm [13:24:12] like feel free to proceed [13:24:13] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors: - clo... [13:24:19] ok [13:24:22] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:24:23] !log lucaswerkmeister-wmde@deploy1003 aude, lucaswerkmeister-wmde: Continuing with sync [13:24:25] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Security updates [13:24:25] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [13:24:27] but it worked on testwiki? [13:24:31] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [13:24:31] yes [13:24:31] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Security updates [13:24:34] I’% just looking at the timestamp again… looks correct to me [13:24:36] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie [13:24:49] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie [13:24:58] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P90502 and previous config saved to /var/cache/conftool/dbconfig/20260413-132457-fceratto.json [13:25:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:25:03] (*I’m) [13:25:17] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2221.codfw.wmnet with reason: Maintenance [13:25:42] maybe some part of the account signup flow didn’t have X-Wikimedia-Debug applied 🤔 [13:25:43] no idea [13:25:49] anyway, you can debug that later at your leisure ^^ [13:26:05] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90503 and previous config saved to /var/cache/conftool/dbconfig/20260413-132604-fceratto.json [13:26:13] ah maybe that's it [13:27:19] or it mw-debug.eqiad.pinkunicorn-6956bb54cc-lskrx and it'sworking [13:28:27] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1270436 (owner: 10Blake) [13:29:33] (03CR) 10Kamila Součková: [C:03+1] "My bad, I needed lunch '^^ LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah) [13:29:53] (03CR) 10Kevin Bazira: [C:03+2] istio-proxy: fix Lua script in EnvoyFilter to correctly rewrite KServe batcher error responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270442 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [13:30:03] btullis@cumin1003 reimage (PID 2391205) is awaiting input [13:31:28] (03CR) 10Bking: [C:03+2] nginx tls proxy: remove defunct directive [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [13:35:14] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host an-redacteddb1001 - btullis@cumin1003" [13:35:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host an-redacteddb1001 - btullis@cumin1003" [13:35:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:35:19] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-redacteddb1001.eqiad.wmnet 18.48.64.10.in-addr.arpa 8.1.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:35:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-redacteddb1001.eqiad.wmnet 18.48.64.10.in-addr.arpa 8.1.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:35:23] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-redacteddb1001 [13:35:38] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814289 (10Jclark-ctr) Found a fried circuit on the board. Replaced the board and moved the CPUs over since the new ones did not match. The fault still continued on the ne... [13:35:59] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Search [13:36:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-redacteddb1001 [13:36:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host an-redacteddb1001 [13:36:45] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1012 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Search [13:37:02] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:37:38] (03Merged) 10jenkins-bot: istio-proxy: fix Lua script in EnvoyFilter to correctly rewrite KServe batcher error responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270442 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [13:37:56] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270016|Opt-in new accounts to ReadingLists beta feature on pilot wikis (T422833)]] (duration: 34m 09s) [13:38:00] T422833: Start opting in new accounts on the pilot wikis (arwiki, frwiki, zhwiki, idwiki and viwiki) - https://phabricator.wikimedia.org/T422833 [13:38:39] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1270043|Re-add p-personal id to the user menu (T422885)]] [13:38:42] T422885: #p-personal disappeared - https://phabricator.wikimedia.org/T422885 [13:40:08] jclark@cumin1003 reimage (PID 2361809) is awaiting input [13:40:42] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90504 and previous config saved to /var/cache/conftool/dbconfig/20260413-134041-fceratto.json [13:40:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:41:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:41:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host phab1006.eqiad.wmnet with OS trixie [13:41:39] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11814316 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host phab1006.eqiad.wmnet with OS trixie completed: - phab1006 (**PASS**)... [13:41:58] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11814319 (10Jclark-ctr) [13:42:07] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11814320 (10Jclark-ctr) 05Open→03Resolved [13:42:17] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aude: Backport for [[gerrit:1270043|Re-add p-personal id to the user menu (T422885)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:42:25] checking [13:42:32] thanks [13:42:55] looks good [13:42:59] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aude: Continuing with sync [13:42:59] yay [13:43:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2070.codfw.wmnet with OS bullseye [13:43:43] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11814327 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye [13:44:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2070 [13:44:34] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [13:45:00] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270447 [13:49:20] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270043|Re-add p-personal id to the user menu (T422885)]] (duration: 10m 41s) [13:49:24] T422885: #p-personal disappeared - https://phabricator.wikimedia.org/T422885 [13:49:33] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2070 - mvernon@cumin2002" [13:49:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2070 - mvernon@cumin2002" [13:49:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:49:39] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2070.codfw.wmnet 86.0.192.10.in-addr.arpa 6.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:49:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno) [13:49:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270388 (owner: 10Jforrester) [13:49:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) (owner: 10Anzx) [13:49:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2070.codfw.wmnet 86.0.192.10.in-addr.arpa 6.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:49:45] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2070 [13:50:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2070 [13:50:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2070 [13:51:10] (03Merged) 10jenkins-bot: EventStreamConfig: remove unused contextual attributes causing problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno) [13:51:31] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P90505 and previous config saved to /var/cache/conftool/dbconfig/20260413-135129-fceratto.json [13:51:31] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270448 [13:51:36] (03Merged) 10jenkins-bot: [abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270388 (owner: 10Jforrester) [13:51:45] Yay. [13:51:51] (03Merged) 10jenkins-bot: urwikisource: add مصنف (author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) (owner: 10Anzx) [13:52:08] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1268965|EventStreamConfig: remove unused contextual attributes causing problems (T422001)]], [[gerrit:1270388|[abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s]], [[gerrit:1269788|urwikisource: add مصنف (author) namespace (T422824)]] [13:52:12] T422001: '.performer.active_browsing_session_token' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T422001 [13:52:13] T422824: Add author namespace to urwikisource - https://phabricator.wikimedia.org/T422824 [13:52:20] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: host reimage [13:52:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270384 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [13:52:58] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [13:53:02] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [13:53:07] !log installing postgresql-common bugfix updates [13:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:13] 🤞 [13:53:48] !log lucaswerkmeister-wmde@deploy1003 sgimeno, anzx, lucaswerkmeister-wmde, jforrester: Backport for [[gerrit:1268965|EventStreamConfig: remove unused contextual attributes causing problems (T422001)]], [[gerrit:1270388|[abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s]], [[gerrit:1269788|urwikisource: add مصنف (author) namespace (T422824)]] synced to the testservers (see https://wikitec [13:53:48] h.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:53:58] looking [13:54:02] * sergi0 checking [13:54:12] James_F: please also test ^^ [13:54:16] Testing. [13:54:41] Lucas_WMDE: looks good to sync [13:54:51] ack [13:54:52] Looks good from my end. [13:55:36] > stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/GeoData/': GnuTLS recv error (-54): Error in the pull function.' [13:55:41] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:55:45] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:55:51] love to see it [13:55:53] (one of the changes that just got a +2 failed on a git error) [13:56:34] Lucas_WMDE: lgtm [13:56:36] (03Merged) 10jenkins-bot: Record TOR account creation failure separately [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270384 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [13:56:38] (03CR) 10CI reject: [V:04-1] stats: add counters for experiment account creation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [13:56:42] !log lucaswerkmeister-wmde@deploy1003 sgimeno, anzx, lucaswerkmeister-wmde, jforrester: Continuing with sync [13:56:44] thanks! [13:57:07] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "try again (T421827)" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [13:58:53] (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [13:59:22] Lucas_WMDE: please run namespacedupes for urwikisource once sync is finished [13:59:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: host reimage [14:00:31] right, thanks for the reminder [14:00:38] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268965|EventStreamConfig: remove unused contextual attributes causing problems (T422001)]], [[gerrit:1270388|[abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s]], [[gerrit:1269788|urwikisource: add مصنف (author) namespace (T422824)]] (duration: 08m 30s) [14:00:41] (03CR) 10Ladsgroup: [C:03+1] "PCC seems to be noop in practice: https://puppet-compiler.wmflabs.org/output/1270432/6373/db1151.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [14:00:43] T422001: '.performer.active_browsing_session_token' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T422001 [14:00:43] T422824: Add author namespace to urwikisource - https://phabricator.wikimedia.org/T422824 [14:00:47] (backports need a few more minutes in CI anyway) [14:01:09] (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [14:01:40] 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065#11814472 (10KylieTastic) I have also just noticed that files where I deleted old revisions, such as https://en.wikipedia.org/wiki/File:SuccessKid.jpg, the old versions do not show "No thumbnail" but... [14:01:43] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes urwikisource --fix # T422824 [14:02:19] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P90506 and previous config saved to /var/cache/conftool/dbconfig/20260413-140218-fceratto.json [14:02:23] anzx: done [14:02:33] Lucas_WMDE: thanks for deploying [14:02:46] jouncebot: nowandnext [14:02:46] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [14:02:46] In 0 hour(s) and 27 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1430) [14:02:51] we’re technically past the end of the window but let’s still do the backports [14:03:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [14:03:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [14:04:23] (03PS1) 10Brouberol: growthbook: release unofficial build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270450 (https://phabricator.wikimedia.org/T420781) [14:04:57] thanks! [14:07:00] (03Merged) 10jenkins-bot: GrowthSuggestionToneCheck: flag as non-experimental [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [14:07:43] (03CR) 10Marostegui: "You can create one and then protect it as security issue and then it can be tested with that one." [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto) [14:08:40] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:09:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage [14:09:57] (03CR) 10JMeybohm: rest-gateway: Add liftwing listeners and network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:10:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:11:21] (03Merged) 10jenkins-bot: stats: add counters for experiment account creation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [14:11:38] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1270384|Record TOR account creation failure separately (T422283)]], [[gerrit:1270383|stats: add counters for experiment account creation (T422283)]], [[gerrit:1269495|GrowthSuggestionToneCheck: flag as non-experimental (T422835)]] [14:11:44] T422283: [V1 experiment changes] Enable reliable measurement of account creation for mobile registration experiment on auth.wikimedia.org domain and support broader rollout - https://phabricator.wikimedia.org/T422283 [14:11:44] T422835: Revise Tone tasks are warning users with "Experimental edit check. For testing purposes only." warning - https://phabricator.wikimedia.org/T422835 [14:12:26] at last [14:12:37] [ian mcdiarmid voice] [14:12:39] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc1011: T419961 [14:12:39] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [14:13:07] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90507 and previous config saved to /var/cache/conftool/dbconfig/20260413-141306-fceratto.json [14:13:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:13:16] !log lucaswerkmeister-wmde@deploy1003 migr, lucaswerkmeister-wmde, urbanecm: Backport for [[gerrit:1270384|Record TOR account creation failure separately (T422283)]], [[gerrit:1270383|stats: add counters for experiment account creation (T422283)]], [[gerrit:1269495|GrowthSuggestionToneCheck: flag as non-experimental (T422835)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be [14:13:16] verified there. [14:13:20] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:13:20] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1011: T419961 [14:13:27] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: Maintenance [14:13:48] MichaelG_WMF: please test! [14:13:49] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2011: T419961 [14:13:49] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [14:14:06] will test [14:14:11] !log bking@apt1002 sudo -E reprepro --ignore=wrongdistribution -C component/opensearch2 include trixie-wikimedia ~/opensearch-madvise-0.2/opensearch-madvise_0.2_amd64.changes T422860 [14:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:14] T422860: Migrate Cloudelastic to OpenSearch 2.x - https://phabricator.wikimedia.org/T422860 [14:14:15] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P90509 and previous config saved to /var/cache/conftool/dbconfig/20260413-141414-fceratto.json [14:14:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:14:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2011: T419961 [14:14:44] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1012: Security updates [14:14:44] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [14:14:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage [14:14:51] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:14:51] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1012: Security updates [14:15:02] Lucas_WMDE: The experimental warning is no longer there! Good to move forward 👍 [14:15:12] (03PS1) 10Marostegui: db2224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270451 (https://phabricator.wikimedia.org/T422777) [14:15:19] jclark@cumin1003 reimage (PID 2449143) is awaiting input [14:15:29] (03CR) 10Bearloga: [C:03+1] growthbook: release unofficial build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270450 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol) [14:15:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2224.codfw.wmnet with reason: Reimage to Trixie [14:16:01] (03CR) 10Brouberol: [C:03+2] growthbook: release unofficial build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270450 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol) [14:16:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2224: Reimage [14:16:21] (03CR) 10Marostegui: [C:03+2] db2224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270451 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui) [14:16:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2224: Reimage [14:17:08] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:18:15] !log lucaswerkmeister-wmde@deploy1003 migr, lucaswerkmeister-wmde, urbanecm: Continuing with sync [14:18:20] MichaelG_WMF: thanks! sorry, got distracted [14:18:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:19:01] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:19:21] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2224.codfw.wmnet with OS trixie [14:19:45] (03PS1) 10Majavah: P:rsyslog: Update Keystone unit file names [puppet] - 10https://gerrit.wikimedia.org/r/1270452 (https://phabricator.wikimedia.org/T421911) [14:19:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [14:19:58] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:20:13] PROBLEM - MariaDB Replica IO: pc2 on pc2012 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1012.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1012.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:20:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-redacteddb1001.eqiad.wmnet with OS bookworm [14:22:00] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270384|Record TOR account creation failure separately (T422283)]], [[gerrit:1270383|stats: add counters for experiment account creation (T422283)]], [[gerrit:1269495|GrowthSuggestionToneCheck: flag as non-experimental (T422835)]] (duration: 10m 22s) [14:22:05] T422283: [V1 experiment changes] Enable reliable measurement of account creation for mobile registration experiment on auth.wikimedia.org domain and support broader rollout - https://phabricator.wikimedia.org/T422283 [14:22:06] T422835: Revise Tone tasks are warning users with "Experimental edit check. For testing purposes only." warning - https://phabricator.wikimedia.org/T422835 [14:22:22] !log UTC afternoon backport+config window done [14:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:16] jouncebot: next [14:23:16] In 0 hour(s) and 6 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1430) [14:23:17] (03PS1) 10Bearloga: EventStreamConfig: remove ABST contextual attribute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) [14:23:55] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Allow to easily disable puppet-merges temporarily - https://phabricator.wikimedia.org/T423121#11814631 (10CDanis) Sounds a lot like {T248872} ? [14:25:42] (03CR) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:26:14] RECOVERY - MariaDB Replica IO: pc2 on pc2012 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:26:22] (03CR) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:27:46] 06SRE, 06Infrastructure-Foundations: Update debdeploy to use checkrestart instead of lsof to detect library restarts - https://phabricator.wikimedia.org/T422614#11814638 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:28:14] PROBLEM - MariaDB Replica Lag: pc2 on pc2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:28:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251#11814639 (10AnneT) @MoritzMuehlenhoff apologies; I was out last week. I've confirmed that I can now access my experiment data in superset - thanks very much! [14:28:52] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P90512 and previous config saved to /var/cache/conftool/dbconfig/20260413-142851-fceratto.json [14:28:56] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:29:35] (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1430) [14:32:39] (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [14:34:32] (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [14:34:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2070.codfw.wmnet with OS bullseye [14:34:46] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11814687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye compl... [14:35:26] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186#11814699 (10LSobanski) 05Open→03Declined [14:36:06] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2224.codfw.wmnet with reason: host reimage [14:37:30] (03PS2) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [14:38:13] (03CR) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [14:39:41] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P90513 and previous config saved to /var/cache/conftool/dbconfig/20260413-143939-fceratto.json [14:39:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS bullseye [14:40:05] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11814730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye [14:40:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2224.codfw.wmnet with reason: host reimage [14:40:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2069 [14:41:00] (03PS1) 10FNegri: mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806) [14:41:02] (03PS1) 10FNegri: mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806) [14:43:09] (03PS1) 10Marostegui: Revert "db2224: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270467 [14:43:29] mvernon@cumin2002 reimage (PID 2300741) is awaiting input [14:43:33] 06SRE, 06Infrastructure-Foundations: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610#11814753 (10LSobanski) 05Open→03Declined The effort of moving accounts to new UIDs is too high. [14:44:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Rake tasks: add colours and buffer output - https://phabricator.wikimedia.org/T237508#11814778 (10jhathaway) 05Open→03Declined I don't think buffering the output is always wanted, as you may want to see the first error as quickly as possible, so dec... [14:44:58] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [14:46:19] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Core: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609#11814782 (10LSobanski) 05Open→03Resolved a:03LSobanski Should have been addressed with upgrade to Puppet 7, please reopen if you st... [14:47:06] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: puppet-merge shouldn't fail if `tput` doesn't grok your terminal - https://phabricator.wikimedia.org/T221985#11814801 (10LSobanski) 05Open→03Declined [14:48:27] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577#11814809 (10LSobanski) 05Open→03Declined Shouldn't be a problem with today's infrastructure. [14:48:50] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2069 - mvernon@cumin2002" [14:48:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2069 - mvernon@cumin2002" [14:48:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:48:56] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2069.codfw.wmnet 181.48.192.10.in-addr.arpa 1.8.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2069.codfw.wmnet 181.48.192.10.in-addr.arpa 1.8.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:01] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2069 [14:49:03] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Puppet wmf-style-guide: array of classes not detected properly - https://phabricator.wikimedia.org/T179230#11814825 (10LSobanski) p:05Medium→03Low [14:50:14] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: more verbose hiera messages on failures - https://phabricator.wikimedia.org/T109692#11814828 (10LSobanski) 05Open→03Declined Closing, please reopen if still a problem on the current Puppet version. [14:50:29] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P90514 and previous config saved to /var/cache/conftool/dbconfig/20260413-145028-fceratto.json [14:50:52] 06SRE, 06Infrastructure-Foundations: reprepro: automate incoming processing - https://phabricator.wikimedia.org/T215812#11814830 (10MoritzMuehlenhoff) p:05Medium→03Low [14:50:57] (03PS1) 10Kamila Součková: Revert "Enable $wgTempCategoryCollations for testwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270470 (https://phabricator.wikimedia.org/T422546) [14:51:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2069 [14:51:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2069 [14:52:14] RECOVERY - MariaDB Replica Lag: pc2 on pc2012 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:52:19] (03PS1) 10Kamila Součková: Revert "Temporarily add shellbox-icu to $wgShellboxUrls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270472 (https://phabricator.wikimedia.org/T422546) [14:53:12] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2012: T419961 [14:53:12] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [14:53:25] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:53:25] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2012: T419961 [14:53:33] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc1012: T419961 [14:53:34] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [14:53:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:53:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1012: T419961 [14:54:15] (03Abandoned) 10Kamila Součková: Temporarily add shellbox-icu ClusterIP endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [14:54:36] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie [14:54:51] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie [14:54:54] (03CR) 10Ladsgroup: "Do we have "client" wikis?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [14:55:50] FIRING: [14x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:10] jouncebot: nowandnext [15:00:10] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [15:00:10] In 0 hour(s) and 29 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1530) [15:00:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T410589)', diff saved to https://phabricator.wikimedia.org/P90516 and previous config saved to /var/cache/conftool/dbconfig/20260413-150019-ladsgroup.json [15:00:24] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:00:37] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1166: repool after maintenance [15:00:45] (03PS1) 10Kevin Bazira: istio-proxy: move kserve-batcher-json-error-rewrite EnvoyFilter to istio-system ns to cover production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270476 (https://phabricator.wikimedia.org/T422482) [15:00:50] RESOLVED: [22x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:18] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P90518 and previous config saved to /var/cache/conftool/dbconfig/20260413-150116-fceratto.json [15:01:27] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:02:00] (03CR) 10Marostegui: [C:03+2] Revert "db2224: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270467 (owner: 10Marostegui) [15:03:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2224.codfw.wmnet with OS trixie [15:03:16] (03CR) 10Dpogorzelski: [C:03+1] istio-proxy: move kserve-batcher-json-error-rewrite EnvoyFilter to istio-system ns to cover production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270476 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [15:04:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:04:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2224: After Reimage [15:04:53] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2224: After Reimage [15:05:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2224: After Reimage [15:05:12] (03PS2) 10FNegri: mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806) [15:05:12] (03PS2) 10FNegri: mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806) [15:06:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:07:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:07:58] (03PS1) 10Marostegui: db1187.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270478 (https://phabricator.wikimedia.org/T422777) [15:08:22] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1187: Upgrade package [15:08:43] (03CR) 10Marostegui: [C:03+2] db1187.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270478 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui) [15:08:44] (03CR) 10Kevin Bazira: [C:03+2] istio-proxy: move kserve-batcher-json-error-rewrite EnvoyFilter to istio-system ns to cover production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270476 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [15:08:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1187.eqiad.wmnet with reason: Reimage to Trixie [15:08:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1187: Upgrade package [15:09:11] (03PS10) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [15:09:33] (03CR) 10Elukey: tox: rework venvs to speed up local and CI timings (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [15:09:54] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1013: Security updates [15:09:55] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [15:10:02] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [15:10:02] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1013: Security updates [15:10:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P90522 and previous config saved to /var/cache/conftool/dbconfig/20260413-151027-ladsgroup.json [15:10:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1187.eqiad.wmnet with OS trixie [15:11:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:12:52] (03PS1) 10Hnowlan: prometheus, thanos: move recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663) [15:13:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:14:14] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:15:03] 06SRE, 10Lift-Wing, 06Machine-Learning-Team: Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11815008 (10DPogorzelski-WMF) [15:16:14] PROBLEM - MariaDB Replica IO: pc3 on pc2013 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1013.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1013.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:16:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:16:56] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:17:00] (03CR) 10Brouberol: [C:03+2] Allow WMDE Airflow instance to egress to dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269978 (https://phabricator.wikimedia.org/T414583) (owner: 10Andrew McAllister (WMDE)) [15:17:11] (03Merged) 10jenkins-bot: istio-proxy: move kserve-batcher-json-error-rewrite EnvoyFilter to istio-system ns to cover production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270476 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira) [15:20:14] RECOVERY - MariaDB Replica IO: pc3 on pc2013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:20:15] (03PS1) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) [15:20:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P90526 and previous config saved to /var/cache/conftool/dbconfig/20260413-152034-ladsgroup.json [15:20:48] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:20:51] !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:21:01] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:21:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:21:14] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:21:17] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:21:19] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:21:28] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:21:49] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:21:52] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:21:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:21:59] Deployment aqs-http-gateway-main in editor-analytics at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=editor-analytics&var-deployment=aqs-http-gateway-main - ... [15:21:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:22:16] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:25:14] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:25:41] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage [15:26:12] (03PS1) 10Muehlenhoff: Record LDAP access for passimacopoulos [puppet] - 10https://gerrit.wikimedia.org/r/1270483 [15:26:28] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:27:16] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:27:31] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for passimacopoulos [puppet] - 10https://gerrit.wikimedia.org/r/1270483 (owner: 10Muehlenhoff) [15:27:37] (03CR) 10Phuedx: [C:03+1] EventStreamConfig: remove ABST contextual attribute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) (owner: 10Bearloga) [15:28:09] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815072 (10Marostegui) 05Open→03Resolved Thanks John for trying swapping many parts - unfortunately it didn't work so I am going to close this task and open a new... [15:29:06] mvernon@cumin2002 reimage (PID 2300741) is awaiting input [15:29:10] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815088 (10Marostegui) [15:29:20] (03CR) 10Lucas Werkmeister (WMDE): Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [15:29:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:29:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage [15:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1530). nyaa~ [15:30:15] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:30:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T410589)', diff saved to https://phabricator.wikimedia.org/P90527 and previous config saved to /var/cache/conftool/dbconfig/20260413-153042-ladsgroup.json [15:30:46] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:30:59] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [15:31:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T410589)', diff saved to https://phabricator.wikimedia.org/P90529 and previous config saved to /var/cache/conftool/dbconfig/20260413-153107-ladsgroup.json [15:31:24] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:31:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:31:44] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:32:01] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:32:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:33:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:35:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:36:52] !log installing postgresql-15 security updates [15:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:11] (03CR) 10Andrew Bogott: [C:03+2] "I suspect this is necessary but insufficient" [puppet] - 10https://gerrit.wikimedia.org/r/1270452 (https://phabricator.wikimedia.org/T421911) (owner: 10Majavah) [15:37:23] !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1013: Security updates [15:37:23] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [15:37:34] (03PS3) 10FNegri: mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806) [15:37:34] (03PS3) 10FNegri: mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806) [15:37:35] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [15:37:35] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1013: Security updates [15:39:18] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [15:40:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:41:17] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:41:27] (03PS2) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) [15:42:31] (03CR) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [15:43:35] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:43:40] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:43:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [15:46:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1166: repool after maintenance [15:46:37] (03PS1) 10Marostegui: Revert "db1187.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270485 [15:48:15] (03PS1) 10CDanis: check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230) [15:48:41] (03CR) 10CI reject: [V:04-1] check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230) (owner: 10CDanis) [15:49:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [15:49:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T419635)', diff saved to https://phabricator.wikimedia.org/P90533 and previous config saved to /var/cache/conftool/dbconfig/20260413-154937-fceratto.json [15:49:41] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:50:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2224: After Reimage [15:50:40] (03PS2) 10CDanis: check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230) [15:51:12] (03CR) 10CI reject: [V:04-1] check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230) (owner: 10CDanis) [15:51:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1187.eqiad.wmnet with OS trixie [15:52:02] (03PS3) 10CDanis: check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230) [15:52:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T419635)', diff saved to https://phabricator.wikimedia.org/P90535 and previous config saved to /var/cache/conftool/dbconfig/20260413-155253-fceratto.json [15:53:05] (03CR) 10Elukey: amg-gpu: Set up explicit GPU partitioning (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [15:55:53] (03CR) 10Marostegui: [C:03+2] Revert "db1187.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270485 (owner: 10Marostegui) [15:56:25] (03PS2) 10Elukey: istio: revisit Prometheus buckets for Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) [15:56:47] (03CR) 10Lucas Werkmeister (WMDE): "LGTM; should be deployed after I98225a2309 has been merged (but doesn’t need a Depends-On in the commit message, otherwise scap would refu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [15:56:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [15:56:51] (03CR) 10Tiziano Fogli: [C:03+1] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite) [15:58:41] (03CR) 10Elukey: "It turns out that Wikikube emits ~700k time series every 5m, while the ML clusters one order of magnitude less." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [15:59:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1187: After Reimage [16:02:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2069.codfw.wmnet with OS bullseye [16:02:08] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11815292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye compl... [16:02:46] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11815293 (10MoritzMuehlenhoff) [16:03:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P90537 and previous config saved to /var/cache/conftool/dbconfig/20260413-160301-fceratto.json [16:07:50] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1014: Security updates [16:07:50] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [16:07:57] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [16:07:57] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1014: Security updates [16:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:13:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P90539 and previous config saved to /var/cache/conftool/dbconfig/20260413-161310-fceratto.json [16:13:20] jouncebot: nowandnext [16:13:20] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [16:13:20] In 0 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1700) [16:13:20] In 0 hour(s) and 46 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1700) [16:14:19] PROBLEM - MariaDB Replica IO: pc4 on pc2014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1014.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1014.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:48] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1019.eqiad.wmnet with OS trixie [16:15:01] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors: -... [16:17:11] (03PS1) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableEventGoals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) [16:17:31] (03PS2) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableEventGoals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) [16:18:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) (owner: 10Daimona Eaytoy) [16:19:20] RECOVERY - MariaDB Replica IO: pc4 on pc2014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:22:18] (03CR) 10Federico Ceratto: [C:03+1] sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [16:22:26] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [16:23:08] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [16:23:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T419635)', diff saved to https://phabricator.wikimedia.org/P90541 and previous config saved to /var/cache/conftool/dbconfig/20260413-162318-fceratto.json [16:23:22] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:23:37] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [16:23:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T419635)', diff saved to https://phabricator.wikimedia.org/P90542 and previous config saved to /var/cache/conftool/dbconfig/20260413-162344-fceratto.json [16:26:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T419635)', diff saved to https://phabricator.wikimedia.org/P90543 and previous config saved to /var/cache/conftool/dbconfig/20260413-162657-fceratto.json [16:28:17] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1273.eqiad.wmnet [16:28:54] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1273.eqiad.wmnet [16:30:16] 10ops-codfw, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159 (10Jhancock.wm) 03NEW [16:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:20] !log banning non-standard thumbs with external referrer regardless of cache status (T414805) [16:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:23] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [16:35:53] !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1014: Security updates [16:35:53] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [16:36:07] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [16:36:07] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1014: Security updates [16:37:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P90546 and previous config saved to /var/cache/conftool/dbconfig/20260413-163706-fceratto.json [16:37:56] (03PS9) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) [16:37:56] (03PS9) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [16:37:56] (03PS9) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [16:40:01] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [16:40:46] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [16:41:15] (03PS2) 10Ladsgroup: envoy: Close connections to swift after 10s of inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1270031 (https://phabricator.wikimedia.org/T328872) [16:41:21] (03CR) 10Ladsgroup: [V:03+2 C:03+2] envoy: Close connections to swift after 10s of inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1270031 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [16:44:05] !log contint2002 (prod CI) - re-enabled puppet - this applied a refresh of the contint.wikimedia.org certificate [16:44:05] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[7002-7008].magru.wmnet} and A:cp - 9.2.13 Upgrade () [16:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1187: After Reimage [16:44:45] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[7010-7016].magru.wmnet} and A:cp - 9.2.13 Upgrade () [16:46:09] !log contint2002 (prod CI) - re-enabled puppet - this applied a refresh of the contint.wikimedia.org certificate (T423152 T420993) [16:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:14] T423152: PuppetDisabled - contint2002 - https://phabricator.wikimedia.org/T423152 [16:46:15] T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993 [16:46:27] 10ops-codfw, 06collaboration-services, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159#11815664 (10Ladsgroup) Hi, that's for sre-collab team since they own mailman now! [16:47:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P90548 and previous config saved to /var/cache/conftool/dbconfig/20260413-164713-fceratto.json [16:47:25] (03PS1) 10CDanis: aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486) [16:50:38] (03CR) 10BCornwall: [C:03+1] Temporarily depool puppetserver1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/1270441 (owner: 10Muehlenhoff) [16:50:51] (03CR) 10BCornwall: [C:03+1] wikimedia.org: Restore original TTL for dumps [dns] - 10https://gerrit.wikimedia.org/r/1270363 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [16:51:44] 06SRE: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168 (10Scott_French) 03NEW [16:52:47] 06SRE: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815713 (10Scott_French) I've verified that manually deleting an editor-analytics pod in staging will trigger crash looping, and then setting initialDelaySeconds on the liveness probe (in t... [16:53:28] (03PS1) 10Scott French: aqs2-common: Remove decommed aqs1012 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270496 (https://phabricator.wikimedia.org/T423168) [16:53:53] 06SRE, 13Patch-For-Review: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815719 (10Scott_French) p:05Triage→03High [16:55:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) (owner: 10Ladsgroup) [16:55:19] (03CR) 10Eevans: [C:03+1] aqs2-common: Remove decommed aqs1012 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270496 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [16:56:35] (03Merged) 10jenkins-bot: ExternalStore: Start reading and writing from clusters 32 and 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) (owner: 10Ladsgroup) [16:56:44] (03CR) 10JHathaway: [C:03+2] provision: Workaround Supermicro BIOS to UEFI bug (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1262196 (https://phabricator.wikimedia.org/T393053) (owner: 10JHathaway) [16:56:48] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]] [16:56:52] T421729: Create cluster32 and cluster33 in existing es6 and es7 hosts - https://phabricator.wikimedia.org/T421729 [16:57:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T419635)', diff saved to https://phabricator.wikimedia.org/P90549 and previous config saved to /var/cache/conftool/dbconfig/20260413-165721-fceratto.json [16:57:25] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:57:36] 06SRE, 13Patch-For-Review: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815745 (10Scott_French) [16:57:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [16:57:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2161 (T419635)', diff saved to https://phabricator.wikimedia.org/P90550 and previous config saved to /var/cache/conftool/dbconfig/20260413-165747-fceratto.json [16:58:24] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:58:43] (03CR) 10Scott French: [C:03+2] aqs2-common: Remove decommed aqs1012 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270496 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [16:59:22] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1700) [17:00:05] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1700). [17:00:27] o/ [17:00:44] I'll be deploying to a handful of non-MediaWiki services during this window [17:00:45] (03PS1) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 [17:01:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T419635)', diff saved to https://phabricator.wikimedia.org/P90551 and previous config saved to /var/cache/conftool/dbconfig/20260413-170059-fceratto.json [17:01:41] (03Merged) 10jenkins-bot: aqs2-common: Remove decommed aqs1012 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270496 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French) [17:03:09] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [17:03:31] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]] (duration: 06m 43s) [17:03:36] T421729: Create cluster32 and cluster33 in existing es6 and es7 hosts - https://phabricator.wikimedia.org/T421729 [17:06:13] 06SRE, 13Patch-For-Review: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815811 (10Eevans) [17:06:41] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1017: Security updates [17:06:41] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [17:06:49] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [17:06:49] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1017: Security updates [17:07:04] 10ops-codfw, 06collaboration-services, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159#11815840 (10Jhancock.wm) a:05Ladsgroup→03None np! [17:10:43] (03CR) 10JMeybohm: [C:03+1] aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486) (owner: 10CDanis) [17:11:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P90553 and previous config saved to /var/cache/conftool/dbconfig/20260413-171107-fceratto.json [17:11:39] (03PS2) 10CDanis: aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486) [17:11:48] (03CR) 10CDanis: [C:03+2] aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486) (owner: 10CDanis) [17:12:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:13:41] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:57] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [17:14:11] (03Merged) 10jenkins-bot: aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486) (owner: 10CDanis) [17:15:13] (03PS1) 10Ladsgroup: Revert^6 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270500 [17:17:45] jhancock@cumin2002 provision (PID 2451662) is awaiting input [17:18:56] !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:19:16] (03CR) 10CDanis: [C:03+1] Revert^6 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270500 (owner: 10Ladsgroup) [17:19:29] !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:19:57] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [17:20:18] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [17:21:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P90554 and previous config saved to /var/cache/conftool/dbconfig/20260413-172115-fceratto.json [17:21:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:21:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270500 (owner: 10Ladsgroup) [17:22:57] (03Merged) 10jenkins-bot: Revert^6 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270500 (owner: 10Ladsgroup) [17:23:12] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1270500|Revert^6 "Use envoy for swift inside mediawiki"]] [17:24:49] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1270500|Revert^6 "Use envoy for swift inside mediawiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:26:24] 06SRE, 06Infrastructure-Foundations, 10netops, 07Incident Severity 3: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11815922 (10MLechvien-WMF) [17:26:52] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[7010-7016].magru.wmnet} and A:cp - 9.2.13 Upgrade () [17:26:57] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:26:58] 10ops-codfw, 06DC-Ops: wikikube-worker2190 System Configuration Check error - https://phabricator.wikimedia.org/T423175 (10Jhancock.wm) 03NEW [17:27:45] 10ops-codfw, 06DC-Ops, 06ServiceOps new: wikikube-worker2190 System Configuration Check error - https://phabricator.wikimedia.org/T423175#11815952 (10Jhancock.wm) [17:29:25] 06SRE, 13Patch-For-Review: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815965 (10Scott_French) Plot twist: Deploying https://gerrit.wikimedia.org/r/1270496 to editor-analytics staging failed, again with a (initial) liveness check timeou... [17:29:48] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[7002-7008].magru.wmnet} and A:cp - 9.2.13 Upgrade () [17:30:43] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270500|Revert^6 "Use envoy for swift inside mediawiki"]] (duration: 07m 31s) [17:31:19] !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:31:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T419635)', diff saved to https://phabricator.wikimedia.org/P90555 and previous config saved to /var/cache/conftool/dbconfig/20260413-173123-fceratto.json [17:31:27] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:31:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [17:31:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T419635)', diff saved to https://phabricator.wikimedia.org/P90556 and previous config saved to /var/cache/conftool/dbconfig/20260413-173148-fceratto.json [17:32:12] !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:33:12] !log dropping templatelinks and pagelinks on testcommonswiki core db (T421914) [17:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:15] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [17:33:33] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:33:35] !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1017: Security updates [17:33:35] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [17:33:49] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [17:33:49] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1017: Security updates [17:34:20] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:35:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T419635)', diff saved to https://phabricator.wikimedia.org/P90558 and previous config saved to /var/cache/conftool/dbconfig/20260413-173501-fceratto.json [17:35:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:35:37] (03CR) 10Jdlrobson: "recheck" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [17:36:40] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:37:06] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:39:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:40:19] !log applied latent external-services network policy changes for aqs{1023,1024} - T423168 [17:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:22] T423168: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168 [17:40:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:41:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:41:44] Deployment aqs-http-gateway-main in editor-analytics at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=editor-analytics&var-deployment=aqs-http-gateway-main - ... [17:41:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:41:51] \o/ [17:42:22] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:42:53] (03CR) 10Michael Große: "We can probably abandon this. It is for -wmf.22, and -wmf.23 has already been rolled out to all wikis last week." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [17:44:40] (03PS2) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 [17:44:55] 10ops-codfw, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177 (10Jhancock.wm) 03NEW [17:45:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P90559 and previous config saved to /var/cache/conftool/dbconfig/20260413-174509-fceratto.json [17:45:10] (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [17:45:41] 06SRE: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11816149 (10Scott_French) So, once the external-services network policy changes were applied, the crash-looping pod in editor-analytics was able to start successfully. That means the `i/o t... [17:46:13] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [17:46:31] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [17:47:06] (03PS3) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 [17:47:35] (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [17:49:49] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11816171 (10MLechvien-WMF) [17:51:13] 06SRE: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816179 (10Scott_French) [17:51:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [17:52:37] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_ulsfo - 9.2.13 Upgrade () [17:52:51] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_ulsfo - 9.2.13 Upgrade () [17:53:21] (03PS4) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 [17:55:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P90560 and previous config saved to /var/cache/conftool/dbconfig/20260413-175517-fceratto.json [17:55:47] (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [18:01:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:02:22] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:02:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:03:00] (03PS5) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 [18:03:35] 10ops-codfw, 06DC-Ops: sretest2001 has broken psu - https://phabricator.wikimedia.org/T423179 (10Jhancock.wm) 03NEW [18:04:04] !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1018: Security updates [18:04:04] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [18:04:12] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [18:04:12] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1018: Security updates [18:05:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:05:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T419635)', diff saved to https://phabricator.wikimedia.org/P90562 and previous config saved to /var/cache/conftool/dbconfig/20260413-180525-fceratto.json [18:05:26] (03PS1) 10Bking: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) [18:05:29] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:05:34] (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [18:05:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [18:05:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T419635)', diff saved to https://phabricator.wikimedia.org/P90563 and previous config saved to /var/cache/conftool/dbconfig/20260413-180551-fceratto.json [18:05:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:05:59] (03CR) 10CI reject: [V:04-1] opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:06:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1269649 (owner: 10Dzahn) [18:06:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:07:17] (03PS2) 10Bking: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) [18:07:22] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:07:46] (03CR) 10CI reject: [V:04-1] opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:09:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T419635)', diff saved to https://phabricator.wikimedia.org/P90564 and previous config saved to /var/cache/conftool/dbconfig/20260413-180902-fceratto.json [18:09:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:10:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:10:39] (03PS3) 10Bking: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) [18:11:11] 10ops-codfw, 06DC-Ops: sretest2001 has broken psu - https://phabricator.wikimedia.org/T423179#11816305 (10Jhancock.wm) [18:11:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [18:16:22] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:16:49] (03CR) 10Zabe: "The point is that currently testcommons sees itself as a client of commons since we set nothing else. But we set x1 as a virtual domain. I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [18:16:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [18:17:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:19:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P90565 and previous config saved to /var/cache/conftool/dbconfig/20260413-181911-fceratto.json [18:19:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:20:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:22:23] 10ops-codfw, 10Data-Persistence-Misc, 06DC-Ops: db2201 broken DIMM - https://phabricator.wikimedia.org/T423184 (10Jhancock.wm) 03NEW [18:22:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:26:22] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:27:22] (03PS1) 10Zabe: Start reading from new file tables on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) [18:27:38] (03CR) 10Zabe: [C:04-2] Start reading from new file tables on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [18:27:54] jouncebot: nowandnext [18:27:55] No deployments scheduled for the next 1 hour(s) and 32 minute(s) [18:27:55] In 1 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T2000) [18:28:30] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:29:07] (03CR) 10Zabe: [C:03+2] NewFilesPager: Make sure filerevision is queried before file [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270068 (https://phabricator.wikimedia.org/T422946) (owner: 10Zabe) [18:29:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P90566 and previous config saved to /var/cache/conftool/dbconfig/20260413-182919-fceratto.json [18:30:25] !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1018: Security updates [18:30:26] !log root@cumin1003 START - Cookbook sre.mysql.parsercache [18:30:39] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [18:30:39] !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1018: Security updates [18:33:07] (03PS1) 10Daniel Kinzler: rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) [18:36:55] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_ulsfo - 9.2.13 Upgrade () [18:37:02] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-text_ulsfo - 9.2.13 Upgrade () [18:38:54] (03Merged) 10jenkins-bot: NewFilesPager: Make sure filerevision is queried before file [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270068 (https://phabricator.wikimedia.org/T422946) (owner: 10Zabe) [18:39:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T419635)', diff saved to https://phabricator.wikimedia.org/P90568 and previous config saved to /var/cache/conftool/dbconfig/20260413-183927-fceratto.json [18:39:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:39:45] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [18:39:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T419635)', diff saved to https://phabricator.wikimedia.org/P90569 and previous config saved to /var/cache/conftool/dbconfig/20260413-183953-fceratto.json [18:40:21] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1270068|NewFilesPager: Make sure filerevision is queried before file (T422946)]] [18:40:25] T422946: Expectation (readQueryTime <= 5) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T422946 [18:41:56] !log zabe@deploy1003 zabe: Backport for [[gerrit:1270068|NewFilesPager: Make sure filerevision is queried before file (T422946)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:43:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T419635)', diff saved to https://phabricator.wikimedia.org/P90570 and previous config saved to /var/cache/conftool/dbconfig/20260413-184305-fceratto.json [18:44:16] !log zabe@deploy1003 Sync cancelled. [18:45:07] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_drmrs - 9.2.13 Upgrade () [18:45:10] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_drmrs - 9.2.13 Upgrade () [18:46:46] (03CR) 10JHathaway: [C:03+2] nftables: cleanup tests [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway) [18:47:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11816490 (10Jclark-ctr) @BTullis the drive has arrived when can it be replaced? [18:49:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11816505 (10Jclark-ctr) @klausman did you need us to order the drive? [18:51:10] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: apply [18:51:19] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [18:51:20] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply [18:51:41] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [18:51:42] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply [18:52:16] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [18:52:17] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: apply [18:52:32] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [18:52:34] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [18:53:01] is that a whiff of charlie I detect in the air? [18:53:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P90571 and previous config saved to /var/cache/conftool/dbconfig/20260413-185314-fceratto.json [18:53:24] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [18:53:25] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [18:53:44] rzl: ha, alas all shell loop [18:54:04] haha oh well [18:54:44] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [18:55:05] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [18:59:54] (03PS1) 10Zabe: Revert "NewFilesPager: Make sure filerevision is queried before file" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270517 [19:00:07] (03CR) 10Zabe: [V:03+2 C:03+2] Revert "NewFilesPager: Make sure filerevision is queried before file" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270517 (owner: 10Zabe) [19:00:49] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1270517|Revert "NewFilesPager: Make sure filerevision is queried before file"]] [19:02:26] !log zabe@deploy1003 zabe: Backport for [[gerrit:1270517|Revert "NewFilesPager: Make sure filerevision is queried before file"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:02:52] !log zabe@deploy1003 zabe: Continuing with sync [19:03:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P90572 and previous config saved to /var/cache/conftool/dbconfig/20260413-190322-fceratto.json [19:04:51] FYI, I'll be applying some pending diffs from https://gerrit.wikimedia.org/r/1270496 to the production equivalents of the staging services updated above ^ [19:05:00] (03PS1) 10Arlolra: Deploy PRV to 4 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270518 (https://phabricator.wikimedia.org/T423188) [19:06:40] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270517|Revert "NewFilesPager: Make sure filerevision is queried before file"]] (duration: 05m 51s) [19:07:27] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [19:07:52] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [19:08:23] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [19:08:46] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [19:09:17] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [19:09:38] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [19:10:10] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [19:10:28] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [19:10:59] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [19:11:20] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [19:11:42] (03PS6) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 [19:11:51] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [19:12:08] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [19:12:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [19:13:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T419635)', diff saved to https://phabricator.wikimedia.org/P90573 and previous config saved to /var/cache/conftool/dbconfig/20260413-191330-fceratto.json [19:13:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [19:13:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [19:13:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90574 and previous config saved to /var/cache/conftool/dbconfig/20260413-191355-fceratto.json [19:14:08] (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [19:14:52] !log applied aqs cassandra host list changes from https://gerrit.wikimedia.org/r/1270496 - T423168 [19:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:56] T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168 [19:17:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90575 and previous config saved to /var/cache/conftool/dbconfig/20260413-191707-fceratto.json [19:19:34] (03PS7) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 [19:19:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [19:22:05] (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [19:24:07] 06SRE: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816600 (10Scott_French) [19:25:11] !log 💙cdanis@apt1002.wikimedia.org ~ 🕞🍵 sudo -i reprepro --component main --restrict cidergrinder update trixie-wikimedia [19:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P90576 and previous config saved to /var/cache/conftool/dbconfig/20260413-192715-fceratto.json [19:28:35] (03PS8) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 [19:33:05] 06SRE, 10Cassandra: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816623 (10Scott_French) p:05High→03Medium @Eevans - Could I ask you to pick up the documentation change for Cassandra host turn-up? Basically, once the new host reaches t... [19:34:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:35:02] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1001.eqiad.wmnet [19:35:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:35:23] 06SRE, 10Cassandra: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816626 (10Scott_French) [19:35:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:36:24] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:36:59] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_drmrs - 9.2.13 Upgrade () [19:37:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P90577 and previous config saved to /var/cache/conftool/dbconfig/20260413-193726-fceratto.json [19:38:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott) [19:39:22] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:39:24] (03CR) 10JHathaway: P:base: Make nftables::set resources always defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266205 (owner: 10Majavah) [19:39:49] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-text_drmrs - 9.2.13 Upgrade () [19:41:22] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:42:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet [19:42:16] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1002.eqiad.wmnet [19:45:34] (03PS1) 10Andrew Bogott: ceph config: remove defaults for some optional args [puppet] - 10https://gerrit.wikimedia.org/r/1270542 [19:45:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270542 (owner: 10Andrew Bogott) [19:45:49] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[3066,3068-3073].esams.wmnet} and A:cp - 9.2.13 Upgrade () [19:46:16] (03CR) 10CI reject: [V:04-1] ceph config: remove defaults for some optional args [puppet] - 10https://gerrit.wikimedia.org/r/1270542 (owner: 10Andrew Bogott) [19:46:50] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[3075-3081].esams.wmnet} and A:cp - 9.2.13 Upgrade () [19:47:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90578 and previous config saved to /var/cache/conftool/dbconfig/20260413-194734-fceratto.json [19:47:37] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [19:47:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance [19:47:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2181 (T419635)', diff saved to https://phabricator.wikimedia.org/P90579 and previous config saved to /var/cache/conftool/dbconfig/20260413-194759-fceratto.json [19:49:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet [19:49:29] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1003.eqiad.wmnet [19:50:25] (03Abandoned) 10Andrew Bogott: ceph config: remove defaults for some optional args [puppet] - 10https://gerrit.wikimedia.org/r/1270542 (owner: 10Andrew Bogott) [19:51:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T419635)', diff saved to https://phabricator.wikimedia.org/P90580 and previous config saved to /var/cache/conftool/dbconfig/20260413-195113-fceratto.json [19:56:29] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet [19:58:04] (03PS1) 10Ottomata: mw-page-html-content-change-enrich-next - try sync mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270548 (https://phabricator.wikimedia.org/T421216) [19:59:30] (03PS2) 10Ottomata: mw-page-html-content-change-enrich-next - try sync mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270548 (https://phabricator.wikimedia.org/T421216) [20:00:00] (03CR) 10Ottomata: [V:03+2 C:03+2] mw-page-html-content-change-enrich-next - try sync mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270548 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:01:02] !log andrewtavis-wmde@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [20:01:04] !log andrewtavis-wmde@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [20:01:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P90581 and previous config saved to /var/cache/conftool/dbconfig/20260413-200122-fceratto.json [20:02:25] 10ops-codfw, 10Data-Persistence-Misc, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195 (10Jhancock.wm) 03NEW [20:05:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [20:07:01] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:07:05] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:10:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [20:11:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P90582 and previous config saved to /var/cache/conftool/dbconfig/20260413-201130-fceratto.json [20:21:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T419635)', diff saved to https://phabricator.wikimedia.org/P90583 and previous config saved to /var/cache/conftool/dbconfig/20260413-202137-fceratto.json [20:21:41] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [20:21:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2195.codfw.wmnet with reason: Maintenance [20:22:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2195 (T419635)', diff saved to https://phabricator.wikimedia.org/P90584 and previous config saved to /var/cache/conftool/dbconfig/20260413-202201-fceratto.json [20:22:53] 06SRE, 10Cassandra: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816804 (10Eevans) >>! In T423168#11816624, @Scott_French wrote: > @Eevans - Could I ask you to pick up the documentation change for Cassandra host turn-up? > > Basically, onc... [20:23:22] (03CR) 10Dzahn: [C:03+2] admin: add backup yubikey to myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1269649 (owner: 10Dzahn) [20:25:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T419635)', diff saved to https://phabricator.wikimedia.org/P90585 and previous config saved to /var/cache/conftool/dbconfig/20260413-202506-fceratto.json [20:25:41] (03CR) 10Dzahn: [C:03+2] "with out-of-band verification" [puppet] - 10https://gerrit.wikimedia.org/r/1269649 (owner: 10Dzahn) [20:27:29] (03CR) 10Eevans: [C:03+2] aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [20:28:03] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[3075-3081].esams.wmnet} and A:cp - 9.2.13 Upgrade () [20:31:22] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[3066,3068-3073].esams.wmnet} and A:cp - 9.2.13 Upgrade () [20:35:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P90586 and previous config saved to /var/cache/conftool/dbconfig/20260413-203514-fceratto.json [20:38:06] 06SRE, 10Cassandra: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816863 (10Scott_French) >>! In T423168#11816804, @Eevans wrote: > [...] > These almost always occur in batches (i.e. hardware refreshes, expansions, etc), usually on the order... [20:40:04] 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T423199 (10JKelsoteel-WMF) 03NEW [20:40:10] (03PS4) 10Cwhite: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [20:41:48] (03PS1) 10Kamila Součková: Revert "shellbox: Setup shellbox-icu72" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) [20:41:48] (03PS3) 10Ryan Kemper: growthbook: Add automation API key placeholders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) [20:41:48] (03PS1) 10Ryan Kemper: growthbook: Fix env var indent in job template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) [20:41:50] (03PS1) 10Ryan Kemper: growthbook: Drop dead SSO_CONFIG placeholder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696) [20:42:53] (03CR) 10Cwhite: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [20:45:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P90587 and previous config saved to /var/cache/conftool/dbconfig/20260413-204523-fceratto.json [20:46:13] (03CR) 10Scott French: [C:03+1] Revert "Enable $wgTempCategoryCollations for testwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270470 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [20:47:04] (03CR) 10Scott French: [C:03+1] Revert "Temporarily add shellbox-icu to $wgShellboxUrls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270472 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [20:48:30] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:50:40] FIRING: [2x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:53:32] (03CR) 10Scott French: Revert "shellbox: Setup shellbox-icu72" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková) [20:53:42] 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T423199#11816927 (10BCornwall) 05Open→03In progress a:03BCornwall [20:55:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T419635)', diff saved to https://phabricator.wikimedia.org/P90588 and previous config saved to /var/cache/conftool/dbconfig/20260413-205531-fceratto.json [20:55:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [20:55:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [20:55:40] FIRING: [4x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:56:16] (03PS1) 10BCornwall: wikimedia.org: Add TXT verification for Miro [dns] - 10https://gerrit.wikimedia.org/r/1270568 (https://phabricator.wikimedia.org/T423199) [21:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T2100). [21:06:47] (03CR) 10Ssingh: [C:03+1] wikimedia.org: Add TXT verification for Miro [dns] - 10https://gerrit.wikimedia.org/r/1270568 (https://phabricator.wikimedia.org/T423199) (owner: 10BCornwall) [21:08:54] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1025.eqiad.wmnet with reason: Bootstrapping — T412830 [21:08:58] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [21:13:10] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_eqsin - 9.2.13 Upgrade () [21:13:19] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_eqsin - 9.2.13 Upgrade () [21:13:27] (03PS2) 10Bodhisattwa: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 [21:13:41] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:49] (03PS1) 10Mstyles: Route email confirmation funnel through Test Kitchen experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270571 (https://phabricator.wikimedia.org/T420007) [21:15:40] FIRING: [4x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:15:59] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [21:16:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T410589)', diff saved to https://phabricator.wikimedia.org/P90589 and previous config saved to /var/cache/conftool/dbconfig/20260413-211606-ladsgroup.json [21:16:12] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:17:22] (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 4 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270518 (https://phabricator.wikimedia.org/T423188) (owner: 10Arlolra) [21:20:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270571 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles) [21:20:59] (03CR) 10Andrew Bogott: [C:03+2] floating_ip_updater: use project name (not id) for ptr records [puppet] - 10https://gerrit.wikimedia.org/r/1264738 (https://phabricator.wikimedia.org/T421739) (owner: 10Andrew Bogott) [21:23:13] Hey all - I have a couple of sec patches I’d like to get out today. [21:24:32] (03CR) 10Jon Harald Søby: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [21:33:39] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:34:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:36:03] (03PS3) 10Bodhisattwa: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 [21:38:44] (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [21:40:39] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:41:17] !log Deployed security patch for T418533 [21:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:11] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro - https://phabricator.wikimedia.org/T423199#11817142 (10Dzahn) [21:44:37] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:44:49] !log sbassett@deploy1003 Started scap sync-world: Deployed security fix for T422085 [21:47:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:50:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:51:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:52:13] FYI, I'm going to be applying some pending external-services network policy changes in the background [21:52:37] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:52:57] !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [21:53:58] !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [21:54:18] !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [21:55:02] !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [21:55:14] !log brett@cumin2002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=1) Rolling upgrade of ATS on A:cp-text_eqsin - 9.2.13 Upgrade () [21:55:29] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [21:56:32] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [22:02:32] (03CR) 10BCornwall: [C:03+2] wikimedia.org: Add TXT verification for Miro [dns] - 10https://gerrit.wikimedia.org/r/1270568 (https://phabricator.wikimedia.org/T423199) (owner: 10BCornwall) [22:02:48] (03PS1) 10Dzahn: add fake keys for new zuul to connect to gerrit [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895) [22:02:50] !log brett@dns1006 START - running authdns-update [22:02:59] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [22:03:18] (03PS2) 10Dzahn: add fake keys for new zuul to connect to gerrit [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895) [22:03:30] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [22:04:03] (03CR) 10Dzahn: [V:03+2 C:03+2] "not-labs-not-private in labs/private" [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn) [22:04:05] !log applied pending external-services network policy diffs for aqs1025 in wikikube clusters [22:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:12] !log brett@dns1006 END - running authdns-update [22:06:06] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro - https://phabricator.wikimedia.org/T423199#11817200 (10BCornwall) Hi, @JKelsoteel-WMF ! This has been deployed - I'm going to go ahead and close this; Please do re-open if something... [22:06:12] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro - https://phabricator.wikimedia.org/T423199#11817203 (10BCornwall) 05In progress→03Resolved [22:08:44] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_eqsin - 9.2.13 Upgrade () [22:08:59] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[5023-5024].eqsin.wmnet} and A:cp - 9.2.13 Upgrade () [22:11:40] (03CR) 10Bodhisattwa: "thanks for the correction, its now restored" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [22:15:03] !log sbassett@deploy1003 Finished scap sync-world: Deployed security fix for T422085 (duration: 30m 14s) [22:15:46] …and done [22:23:07] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[5023-5024].eqsin.wmnet} and A:cp - 9.2.13 Upgrade () [22:24:37] (03PS1) 10Dzahn: zuul: make gerrit ssh key configurable in Hiera and add it [puppet] - 10https://gerrit.wikimedia.org/r/1270580 (https://phabricator.wikimedia.org/T422895) [22:26:06] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_codfw - 9.2.13 Upgrade () [22:26:09] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_codfw - 9.2.13 Upgrade () [22:27:51] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11817242 (10Ladsgroup) Databases now have a centralized depool and repool cookbook that encapsulates all the different ways you need to depool and repool db hosts (for different... [22:29:50] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5023.* [22:29:54] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5024.* [22:35:19] (03CR) 10ArielGlenn: [C:03+1] rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [22:36:14] (03PS1) 10Brian Wolff: Record file usage from TemplateStyles pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270583 (https://phabricator.wikimedia.org/T413707) [22:50:03] (03PS1) 10Cwhite: logging: add ocsp secret [labs/private] - 10https://gerrit.wikimedia.org/r/1270586 (https://phabricator.wikimedia.org/T350516) [22:51:01] (03CR) 10Cwhite: [V:03+2 C:03+2] logging: add ocsp secret [labs/private] - 10https://gerrit.wikimedia.org/r/1270586 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [22:56:13] (03PS1) 10Cwhite: Revert "logging: add dummy pki "secrets"" [labs/private] - 10https://gerrit.wikimedia.org/r/1270589 [22:56:51] (03CR) 10Cwhite: [V:03+2 C:03+2] Revert "logging: add dummy pki "secrets"" [labs/private] - 10https://gerrit.wikimedia.org/r/1270589 (owner: 10Cwhite) [23:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T2300) [23:00:40] FIRING: [3x] ProbeDown: Service aqs1025-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:02:05] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_codfw - 9.2.13 Upgrade () [23:03:06] (03PS1) 10Cwhite: beta-logs: change root_ocsp_key path to match labs-private [puppet] - 10https://gerrit.wikimedia.org/r/1270590 (https://phabricator.wikimedia.org/T350516) [23:03:12] (03CR) 10Cwhite: [C:03+2] beta-logs: change root_ocsp_key path to match labs-private [puppet] - 10https://gerrit.wikimedia.org/r/1270590 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [23:05:28] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-text_codfw - 9.2.13 Upgrade () [23:05:40] FIRING: [3x] ProbeDown: Service aqs1025-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:09:30] (03PS1) 10Cwhite: beta-logs: change private_cert_base to match labs-private [puppet] - 10https://gerrit.wikimedia.org/r/1270591 (https://phabricator.wikimedia.org/T350516) [23:12:13] (03CR) 10Cwhite: [C:03+2] beta-logs: change private_cert_base to match labs-private [puppet] - 10https://gerrit.wikimedia.org/r/1270591 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [23:15:47] (03PS1) 10Bvibber: Enable ReaderExperiments for itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) [23:17:37] (03CR) 10Eric Gardner: [C:03+1] "LGTM – we can talk about backporting tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) (owner: 10Bvibber) [23:20:05] PROBLEM - Ensure traffic_manager is running for instance backend on cp2057 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:21:05] RECOVERY - Ensure traffic_manager is running for instance backend on cp2057 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:25:58] (03PS1) 10Cwhite: beta-logs: add dummy pki "secrets" [puppet] - 10https://gerrit.wikimedia.org/r/1270593 (https://phabricator.wikimedia.org/T350516) [23:26:54] (03CR) 10Cwhite: [C:03+2] beta-logs: add dummy pki "secrets" [puppet] - 10https://gerrit.wikimedia.org/r/1270593 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [23:39:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270594 [23:39:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270594 (owner: 10TrainBranchBot) [23:42:57] (03PS1) 10Andrew Bogott: floating_ip_updater: use project name (not id) for ptr records [puppet] - 10https://gerrit.wikimedia.org/r/1270595 (https://phabricator.wikimedia.org/T421739) [23:45:07] (03CR) 10Andrew Bogott: [C:03+2] floating_ip_updater: use project name (not id) for ptr records [puppet] - 10https://gerrit.wikimedia.org/r/1270595 (https://phabricator.wikimedia.org/T421739) (owner: 10Andrew Bogott) [23:49:06] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool pool db2208: Work done [23:49:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270594 (owner: 10TrainBranchBot) [23:49:51] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: sync [23:49:58] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: sync [23:50:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:51:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:53:56] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: sync [23:54:03] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: sync [23:54:39] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:55:39] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:59:12] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270600