[00:07:41] <icinga-wm>	 PROBLEM - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s7 at eqiad (db1171) taken on 2026-04-12 23:09:24 is 742 GiB, but the previous one was 941 GiB, a change of -21.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:19:35] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065 (10Pppery) 03NEW
[00:34:55] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065#11812480 (10Pppery)
[00:53:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:09:16] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:15:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:17:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:17:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:34:16] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:34:41] <TimStarling>	 !log on gerrit2003 restarted gerrit T423027
[03:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:34:44] <stashbot>	 T423027: 2026-04-12 Gerrit Outage (was: DiskSpace) - https://phabricator.wikimedia.org/T423027
[03:43:05] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[03:54:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:54:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:55:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:55:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:06:13] <wikibugs>	 (03CR) 10ArielGlenn: "Actually I'd like more clarity :-D  Where are the expensive api queries executed, one or the other of those domains or both?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[04:45:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:45:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:46:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:46:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:53:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:06:46] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11812621 (10Marostegui)  Thank you - let me know if I can help
[05:11:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 67141960 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:12:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3182952 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:20:43] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T410589)', diff saved to https://phabricator.wikimedia.org/P90465 and previous config saved to /var/cache/conftool/dbconfig/20260413-052042-ladsgroup.json
[05:20:47] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[05:30:52] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P90466 and previous config saved to /var/cache/conftool/dbconfig/20260413-053050-ladsgroup.json
[05:41:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P90467 and previous config saved to /var/cache/conftool/dbconfig/20260413-054100-ladsgroup.json
[05:51:07] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T410589)', diff saved to https://phabricator.wikimedia.org/P90468 and previous config saved to /var/cache/conftool/dbconfig/20260413-055106-ladsgroup.json
[05:51:10] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[05:51:23] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[05:51:31] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T410589)', diff saved to https://phabricator.wikimedia.org/P90469 and previous config saved to /var/cache/conftool/dbconfig/20260413-055130-ladsgroup.json
[06:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:44:35] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good. Alternatively we could also simply stick with the 3.0 package as maintained by Debian? For the container images we need to les" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[06:52:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1269466 (owner: 10Muehlenhoff)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T0700)
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:05:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a new Cumin alias to match hosts which are accessible via kerberized SSH [puppet] - 10https://gerrit.wikimedia.org/r/1270279
[07:09:27] <moritzm>	 !log installing openssh security updates
[07:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:52] <wikibugs>	 (03CR) 10Elukey: "I am totally fine with it, my main question mark is who should help maintaining this. I/F can surely help but a team like Traffic is in a " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[07:12:05] <wikibugs>	 (03PS1) 10Majavah: P:kubernetes: deployment_server: Remove kafka cluster IPv6 flag [puppet] - 10https://gerrit.wikimedia.org/r/1270281
[07:13:06] <wikibugs>	 (03CR) 10Muehlenhoff: "I don't have a real prefence either, just mentioning the option :-)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[07:15:15] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::striker: Remove separate monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/1270282
[07:16:06] <wikibugs>	 (03PS4) 10Majavah: hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979
[07:16:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mirror1001.wikimedia.org
[07:17:08] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Enable paging for dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268979 (owner: 10Majavah)
[07:20:28] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply
[07:20:53] <icinga-wm>	 RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[07:21:32] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply
[07:21:38] <wikibugs>	 (03PS2) 10Majavah: wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040)
[07:21:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[07:21:44] <jinxer-wm>	 Deployment aqs-http-gateway-main in editor-analytics at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=editor-analytics&var-deployment=aqs-http-gateway-main - ...
[07:21:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[07:23:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mirror1001.wikimedia.org
[07:25:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[07:34:18] <wikibugs>	 (03PS1) 10Brouberol: growthbook-next: test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270284 (https://phabricator.wikimedia.org/T420781)
[07:35:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:35:35] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[07:38:18] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8406/co" [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah)
[07:40:27] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[07:55:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:01:28] <wikibugs>	 (03PS1) 10Elukey: profile::pki::intermediates: update debmonitor's public key [puppet] - 10https://gerrit.wikimedia.org/r/1270286 (https://phabricator.wikimedia.org/T420993)
[08:06:15] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[08:09:17] <wikibugs>	 (03CR) 10Majavah: [C:03+2] wikimedia.org: Send dumps to LVS service [dns] - 10https://gerrit.wikimedia.org/r/1268955 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[08:09:24] <logmsgbot>	 !log taavi@dns1004 START - running authdns-update
[08:10:46] <logmsgbot>	 !log taavi@dns1004 END - running authdns-update
[08:16:34] <wikibugs>	 (03PS1) 10Majavah: wikimedia.org: Restore original TTL for dumps [dns] - 10https://gerrit.wikimedia.org/r/1270363 (https://phabricator.wikimedia.org/T422040)
[08:17:06] <wikibugs>	 (03CR) 10Elukey: [C:03+2] cfssl::cert: handle the rotation of the intermediate keypair [puppet] - 10https://gerrit.wikimedia.org/r/1265382 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[08:19:31] <wikibugs>	 (03PS1) 10Kevin Bazira: istio-proxy: add EnvoyFilter to rewrite KServe batcher error responses for edit-check isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482)
[08:21:45] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance
[08:22:34] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P90470 and previous config saved to /var/cache/conftool/dbconfig/20260413-082233-fceratto.json
[08:22:39] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:25:49] <wikibugs>	 (03CR) 10MVernon: "Thanks for tagging me on this, but swift no longer uses nginx (and I double-checked on debmonitor.wikimedia.org that I'd not missed any)" [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[08:29:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[08:29:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Obsolete airflow-wmde-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1266959 (owner: 10Muehlenhoff)
[08:30:20] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Great, thanks for this." [puppet] - 10https://gerrit.wikimedia.org/r/1269227 (https://phabricator.wikimedia.org/T422778) (owner: 10Marostegui)
[08:30:26] <wikibugs>	 (03PS1) 10Tiziano Fogli: alerts/deploy: reload config on correct instance during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1270367 (https://phabricator.wikimedia.org/T406054)
[08:38:03] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P90471 and previous config saved to /var/cache/conftool/dbconfig/20260413-083801-fceratto.json
[08:38:06] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:39:43] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] "I understand the problem, we can try it on experimental." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[08:41:38] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] "https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol#inference-response-json-error-object is the more up to " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[08:43:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org
[08:48:51] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P90473 and previous config saved to /var/cache/conftool/dbconfig/20260413-084850-fceratto.json
[08:48:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org
[08:50:46] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270368
[08:50:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270368 (owner: 10TrainBranchBot)
[08:51:40] <wikibugs>	 (03CR) 10Btullis: [C:03+1] growthbook-next: test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270284 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol)
[08:52:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] growthbook-next: test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270284 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol)
[08:53:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:59:40] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P90474 and previous config saved to /var/cache/conftool/dbconfig/20260413-085938-fceratto.json
[09:00:07] <wikibugs>	 (03CR) 10Daniel Kinzler: "On both domains. But wikifunctions isn't routed through the gateway at the moment. It's even running on a separate cluster. It's possible " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[09:01:12] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "Currently, queries to `abstract.wikipedia.org` are executed on the same `mw-api-ext` deployments as other wikis." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[09:03:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270368 (owner: 10TrainBranchBot)
[09:05:06] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581)
[09:05:13] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024
[09:05:19] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130)
[09:07:35] <wikibugs>	 (03PS1) 10Federico Ceratto: sre.mysql.pool: Handle private tasks exception [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460)
[09:07:35] <wikibugs>	 (03CR) 10Federico Ceratto: "Do we have a task where we can test this before merging perhaps?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto)
[09:08:12] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: adjust expressions for multi-instance compactor [alerts] - 10https://gerrit.wikimedia.org/r/1269673 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli)
[09:08:13] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "thanks for sharing links to KServe v2 protocol docs. unfortunately, the kserve batcher seems to only support the KServe v1 protocol: https" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[09:09:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno)
[09:09:59] <wikibugs>	 (03Merged) 10jenkins-bot: thanos/compact: adjust expressions for multi-instance compactor [alerts] - 10https://gerrit.wikimedia.org/r/1269673 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli)
[09:10:28] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T419635)', diff saved to https://phabricator.wikimedia.org/P90476 and previous config saved to /var/cache/conftool/dbconfig/20260413-091027-fceratto.json
[09:10:32] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:10:35] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance
[09:11:24] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P90477 and previous config saved to /var/cache/conftool/dbconfig/20260413-091122-fceratto.json
[09:15:39] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Security updates
[09:15:39] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[09:15:48] <logmsgbot>	 !log root@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99)
[09:15:48] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Security updates
[09:16:31] <wikibugs>	 (03Merged) 10jenkins-bot: istio-proxy: add EnvoyFilter to rewrite KServe batcher error responses for edit-check isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270365 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[09:17:09] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Security updates
[09:17:09] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[09:17:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:17:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Security updates
[09:19:06] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Security updates
[09:19:06] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[09:19:11] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:19:11] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Security updates
[09:19:37] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[09:25:13] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc1 on pc2011 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1011.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1011.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:26:41] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P90479 and previous config saved to /var/cache/conftool/dbconfig/20260413-092640-fceratto.json
[09:26:45] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:29:00] <wikibugs>	 (03PS4) 10Clément Goubert: haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[09:29:13] <icinga-wm>	 RECOVERY - MariaDB Replica IO: pc1 on pc2011 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:32:35] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[09:37:30] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P90480 and previous config saved to /var/cache/conftool/dbconfig/20260413-093729-fceratto.json
[09:38:25] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance
[09:42:58] <wikibugs>	 (03CR) 10Blake: [C:03+2] haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[09:43:13] <wikibugs>	 (03CR) 10Blake: [V:03+2 C:03+2] haproxy: upgrade to Trixie and 3.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1269992 (https://phabricator.wikimedia.org/T422926) (owner: 10Elukey)
[09:47:45] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[09:48:19] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P90481 and previous config saved to /var/cache/conftool/dbconfig/20260413-094818-fceratto.json
[09:49:28] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[09:55:06] <wikibugs>	 (03PS1) 10Blake: thumbor: upgrade haproxy to 3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270374 (https://phabricator.wikimedia.org/T422926)
[09:57:23] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] alerts/deploy: reload config on correct instance during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1270367 (https://phabricator.wikimedia.org/T406054) (owner: 10Tiziano Fogli)
[09:57:51] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] alerts/deploy: reload config on correct instance during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1270367 (https://phabricator.wikimedia.org/T406054) (owner: 10Tiziano Fogli)
[09:59:08] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T419635)', diff saved to https://phabricator.wikimedia.org/P90482 and previous config saved to /var/cache/conftool/dbconfig/20260413-095906-fceratto.json
[09:59:11] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:59:15] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance
[10:00:04] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P90483 and previous config saved to /var/cache/conftool/dbconfig/20260413-100003-fceratto.json
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1000)
[10:00:21] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "I don't have capacity to merge/deploy this atm, but at base this looks good - thanks!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1270150 (https://phabricator.wikimedia.org/T290345) (owner: 10TheDJ)
[10:00:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:01:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] thumbor: upgrade haproxy to 3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270374 (https://phabricator.wikimedia.org/T422926) (owner: 10Blake)
[10:01:41] <wikibugs>	 (03CR) 10Blake: [C:03+2] thumbor: upgrade haproxy to 3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270374 (https://phabricator.wikimedia.org/T422926) (owner: 10Blake)
[10:02:40] <wikibugs>	 (03CR) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[10:03:54] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: upgrade haproxy to 3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270374 (https://phabricator.wikimedia.org/T422926) (owner: 10Blake)
[10:04:12] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[10:05:43] <logmsgbot>	 !log blake@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[10:05:54] <logmsgbot>	 !log blake@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[10:06:21] <wikibugs>	 (03CR) 10MVernon: [C:03+2] apus: add two new storage nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1269963 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon)
[10:06:39] <logmsgbot>	 !log blake@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[10:06:44] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[10:07:26] <logmsgbot>	 !log blake@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[10:09:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1269970 (owner: 10Federico Ceratto)
[10:09:20] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:09:24] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:11:49] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] prometheus: add recording rules for the appservers RED dashboard [puppet] - 10https://gerrit.wikimedia.org/r/1259170 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan)
[10:14:14] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:14:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet
[10:14:52] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:15:11] <logmsgbot>	 !log blake@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[10:15:32] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P90484 and previous config saved to /var/cache/conftool/dbconfig/20260413-101530-fceratto.json
[10:15:35] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[10:15:45] <logmsgbot>	 !log blake@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[10:16:35] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11813595 (10MatthewVernon) 05Resolved→03Open Hi @Jclark-ctr could you take another look at the disks on these two systems, please? There should be 24...
[10:19:20] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:19:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet
[10:19:42] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:19:51] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024
[10:20:05] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 (owner: 10Daniel Kinzler)
[10:20:18] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130)
[10:20:23] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[10:22:34] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 (owner: 10Daniel Kinzler)
[10:22:36] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[10:23:06] <wikibugs>	 (03PS1) 10Elukey: _cookbook: fix parallel test failures with pytest-xdist (-n auto) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475)
[10:26:20] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P90485 and previous config saved to /var/cache/conftool/dbconfig/20260413-102619-fceratto.json
[10:26:40] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[3067,3074].esams.wmnet} and A:cp - 9.2.13 upgrade (T422328)
[10:28:32] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:29:00] <icinga-wm>	 ACKNOWLEDGEMENT - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s7 at eqiad (db1171) taken on 2026-04-12 23:09:24 is 742 GiB, but the previous one was 941 GiB, a change of -21.2 % Jcrespo expected - The acknowledgement expires at: 2026-04-15 10:28:40. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:29:19] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:29:28] <wikibugs>	 (03CR) 10Elukey: "@rcoccioli@wikimedia.org: no-shame-time: I used an AI assistant to navigate the parallel failures since they were really sneaky, but the r" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[10:29:52] <wikibugs>	 (03PS1) 10MVernon: codfw: remove 3 drained ms be nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1270382 (https://phabricator.wikimedia.org/T354872)
[10:30:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1270286 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[10:33:18] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:34:00] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:37:09] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P90486 and previous config saved to /var/cache/conftool/dbconfig/20260413-103707-fceratto.json
[10:37:40] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[3067,3074].esams.wmnet} and A:cp - 9.2.13 upgrade (T422328)
[10:38:15] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:38:42] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:40:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01)
[10:41:06] <wikibugs>	 (03PS1) 10Michael Große: stats: add counters for experiment account creation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283)
[10:41:13] <wikibugs>	 (03PS1) 10Michael Große: Record TOR account creation failure separately [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270384 (https://phabricator.wikimedia.org/T422283)
[10:41:37] <wikibugs>	 (03CR) 10D3r1ck01: "Scheduled for next week on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) (owner: 10D3r1ck01)
[10:43:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270384 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[10:43:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[10:43:34] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11813664 (10MLechvien-WMF) @Scott_French @Blake can we update the description with the conclusion on what n...
[10:44:50] <wikibugs>	 (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[10:46:24] <wikibugs>	 (03CR) 10Volans: "Thanks for digging rabbit hole, comments inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[10:47:57] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T419635)', diff saved to https://phabricator.wikimedia.org/P90487 and previous config saved to /var/cache/conftool/dbconfig/20260413-104756-fceratto.json
[10:48:01] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[10:48:05] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance
[10:48:54] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90488 and previous config saved to /var/cache/conftool/dbconfig/20260413-104852-fceratto.json
[10:50:37] <wikibugs>	 (03CR) 10Elukey: _cookbook: fix parallel test failures with pytest-xdist (-n auto) (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[10:51:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm)
[10:55:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1220:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T423009#11813719 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced optic and cable
[10:57:50] <wikibugs>	 (03PS1) 10Jforrester: [abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270388
[11:02:09] <icinga-wm>	 PROBLEM - Host cirrussearch1103 is DOWN: PING CRITICAL - Packet loss = 100%
[11:03:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch1103:9290 - https://phabricator.wikimedia.org/T422832#11813771 (10Jclark-ctr) 05Open→03Resolved Both cables were present and inserted, with green lights. Reseated the PSU. cleared errors
[11:04:06] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90489 and previous config saved to /var/cache/conftool/dbconfig/20260413-110405-fceratto.json
[11:04:09] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:04:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Temporarily depool puppetserver1003/2004 [dns] - 10https://gerrit.wikimedia.org/r/1270408
[11:05:01] <icinga-wm>	 RECOVERY - Host cirrussearch1103 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[11:14:54] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P90490 and previous config saved to /var/cache/conftool/dbconfig/20260413-111452-fceratto.json
[11:21:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[11:21:59] <jinxer-wm>	 Deployment aqs-http-gateway-main in editor-analytics at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=editor-analytics&var-deployment=aqs-http-gateway-main - ...
[11:21:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[11:25:43] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P90491 and previous config saved to /var/cache/conftool/dbconfig/20260413-112541-fceratto.json
[11:36:32] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90492 and previous config saved to /var/cache/conftool/dbconfig/20260413-113630-fceratto.json
[11:36:36] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:36:39] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2200.codfw.wmnet with reason: Maintenance
[11:36:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Temporarily depool puppetserver1003/2004 [dns] - 10https://gerrit.wikimedia.org/r/1270408 (owner: 10Muehlenhoff)
[11:36:48] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[11:38:07] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[11:38:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installservers: Do not format /srv on an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1269227 (https://phabricator.wikimedia.org/T422778) (owner: 10Marostegui)
[11:48:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2004.codfw.wmnet
[11:49:06] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance
[11:49:55] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P90493 and previous config saved to /var/cache/conftool/dbconfig/20260413-114953-fceratto.json
[11:49:58] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:54:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2004.codfw.wmnet
[11:55:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1003.eqiad.wmnet
[11:58:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Allow to easily disable puppet-merges temporarily - https://phabricator.wikimedia.org/T423121 (10MoritzMuehlenhoff) 03NEW
[12:01:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1003.eqiad.wmnet
[12:04:29] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P90494 and previous config saved to /var/cache/conftool/dbconfig/20260413-120428-fceratto.json
[12:04:33] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:05:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Temporarily depool puppetserver1003/2004" [dns] - 10https://gerrit.wikimedia.org/r/1270413
[12:12:07] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] codfw: remove 3 drained ms be nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1270382 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[12:15:18] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P90495 and previous config saved to /var/cache/conftool/dbconfig/20260413-121516-fceratto.json
[12:19:01] <wikibugs>	 (03PS2) 10Jforrester: [DNM] Make abstractwiki a multi-lingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420)
[12:20:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Temporarily depool puppetserver1003/2004" [dns] - 10https://gerrit.wikimedia.org/r/1270413 (owner: 10Muehlenhoff)
[12:20:23] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[12:21:43] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[12:23:17] <wikibugs>	 (03CR) 10MVernon: [C:03+2] codfw: remove 3 drained ms be nodes for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1270382 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[12:26:06] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P90496 and previous config saved to /var/cache/conftool/dbconfig/20260413-122604-fceratto.json
[12:26:55] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host clouddb1019.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[12:29:27] <wikibugs>	 (03PS1) 10Muehlenhoff: mariadb: Migrate section-specific DBA access rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705)
[12:31:54] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff)
[12:36:29] <wikibugs>	 (03PS3) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804)
[12:36:44] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:36:48] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:36:54] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T419635)', diff saved to https://phabricator.wikimedia.org/P90497 and previous config saved to /var/cache/conftool/dbconfig/20260413-123653-fceratto.json
[12:36:57] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:37:10] <wikibugs>	 (03PS3) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804)
[12:37:13] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance
[12:37:24] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804)
[12:38:02] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P90498 and previous config saved to /var/cache/conftool/dbconfig/20260413-123801-fceratto.json
[12:38:41] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie
[12:38:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[12:38:54] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
[12:40:01] <wikibugs>	 (03PS2) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804)
[12:40:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270388 (owner: 10Jforrester)
[12:45:49] <wikibugs>	 (03PS4) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804)
[12:45:49] <wikibugs>	 (03PS3) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804)
[12:47:53] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1019.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[12:48:25] <wikibugs>	 (03CR) 10Kamila Součková: "Looking at the pcc diff, the IP addresses changed. I haven't looked into why, but I thought this was a no-functional-change-intended patch" [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah)
[12:49:14] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "As far as I can tell that is just a change in how they are ordered?" [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah)
[12:52:32] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P90499 and previous config saved to /var/cache/conftool/dbconfig/20260413-125231-fceratto.json
[12:52:36] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:52:36] <logmsgbot>	 jclark@cumin1003 reimage (PID 2350535) is awaiting input
[12:52:50] <logmsgbot>	 jclark@cumin1003 reimage (PID 2361809) is awaiting input
[12:53:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:54:00] <wikibugs>	 (03PS2) 10Anzx: urwikisource: add مصنف (author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824)
[12:54:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:54:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:54:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) (owner: 10Anzx)
[12:55:22] <Lucas_WMDE>	 urwikisource 🤔
[12:55:22] <Lucas_WMDE>	 https://bash.toolforge.org/quip/AU7VU7Zh6snAnmqnK_td
[12:57:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:57:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:58:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1002.wikimedia.org
[12:59:38] <wikibugs>	 (03PS1) 10Blake: admin: add Blake's backup SSH key. [puppet] - 10https://gerrit.wikimedia.org/r/1270436
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1300).
[13:00:05] <jouncebot>	 aude, Sergi0, MichaelG_WMF, James_F, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Lucas_WMDE>	 o/
[13:00:25] <anzx>	 Lucas_WMDE: 9/
[13:00:28] <sergi0>	 o/
[13:00:28] <moritzm>	 !log installing libnginx-mod-http-lua security updates
[13:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:32] * MichaelG_WMF is here
[13:00:35] <Lucas_WMDE>	 I can deploy
[13:00:40] <aude>	 hi
[13:00:58] <Lucas_WMDE>	 I think aude’s p-personal backport sounds like the most important one, so let’s start the gate-and-submit for that
[13:01:02] <aude>	 my patches can go in any order
[13:01:04] <Lucas_WMDE>	 and then during those 15 minutes deploy config changes
[13:01:13] <aude>	 sounds good
[13:01:18] <Lucas_WMDE>	 (I started looking at anzx’ config change but I’m not done with them yet)
[13:01:21] <wikibugs>	 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11814156 (10Ladsgroup) Thanks. I asked around to see if anyone would be willing to take a look.
[13:01:25] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host phab1006.eqiad.wmnet with OS trixie
[13:01:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude)
[13:01:32] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11814167 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host phab1006.eqiad.wmnet with OS trixie
[13:01:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude)
[13:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: Opt-in new accounts to ReadingLists beta feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270016 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude)
[13:03:21] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P90500 and previous config saved to /var/cache/conftool/dbconfig/20260413-130320-fceratto.json
[13:03:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1270016|Opt-in new accounts to ReadingLists beta feature on pilot wikis (T422833)]]
[13:03:50] <stashbot>	 T422833: Start opting in new accounts on the pilot wikis (arwiki, frwiki, zhwiki, idwiki and viwiki) - https://phabricator.wikimedia.org/T422833
[13:04:13] <MichaelG_WMF>	 two of my changes cannot be positively tested (they only affect metrics/stats that will begin collection after these backports are done), but the third one, GrowthSuggestionToneCheck: flag as non-experimental, will be testable.
[13:04:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1002.wikimedia.org
[13:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:06:24] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Bit confusing that this language seems to put the word for “discussion” at the *front* of the talk namespace name (which means that the wo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) (owner: 10Anzx)
[13:06:59] <James_F>	 Isn't RTL fun?
[13:07:05] <Lucas_WMDE>	 MichaelG_WMF: should they all be deployed together then?
[13:07:13] <Lucas_WMDE>	 James_F: !sey
[13:07:22] <MichaelG_WMF>	 Lucas_WMDE: Doesn't hurt
[13:08:00] <Lucas_WMDE>	 extra fun when I’m looking at MessagesUr.php in emacs and have no idea if Emacs, tmux, and/or GNOME Terminal are responsible for displaying the RTLness correctly, and if they’re all pals with each other about it or not
[13:08:19] <James_F>	 Or if two of them both are broken and cancel each other out?
[13:08:25] <Lucas_WMDE>	 (:
[13:09:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude)
[13:10:01] <James_F>	 00:01:32.263   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/UniversalLanguageSelector/': GnuTLS recv error (-54): Error in the pull function.'
[13:10:03] <James_F>	 Sigh.
[13:10:24] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] "C'mon, CI, we believe in you." [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude)
[13:11:54] <Lucas_WMDE>	 I love T421827
[13:11:54] <stashbot>	 T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827
[13:12:05] <James_F>	 Indeed.
[13:12:27] <Lucas_WMDE>	 meanwhile https://spiderpig.wikimedia.org/jobs/1735 has been building image for quite some time 🤔
[13:12:29] * Lucas_WMDE looks
[13:13:01] <Lucas_WMDE>	 seems like the docker pushes are just taking some time
[13:13:02] <James_F>	 First deploy of the week, so new base SRE image?
[13:13:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:13:30] <James_F>	 https://sal.toolforge.org/production?p=0&q=scap&d= at least.
[13:13:32] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on phab1006.eqiad.wmnet with reason: host reimage
[13:13:50] <Lucas_WMDE>	 good point
[13:13:53] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065#11814190 (10KylieTastic) I have just had the same thing happen at https://en.wikipedia.org/wiki/File:Genie_immediately_after_rescue.jpg
[13:14:09] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P90501 and previous config saved to /var/cache/conftool/dbconfig/20260413-131408-fceratto.json
[13:14:31] <Lucas_WMDE>	 I feel like we might be pushing more images too? (but I don’t have an older log to compare)
[13:14:40] <Lucas_WMDE>	 `grep docker-pusher /var/lib/spiderpig/scap-image-build-and-push-log` shows five images being pushed
[13:14:50] <Lucas_WMDE>	 webserver, singleversion, multiversion, singleversion-debug, singleversion-cli
[13:15:30] <James_F>	 Oh, are the singleversion ones new?
[13:15:56] <Lucas_WMDE>	 nothing super new in https://gitlab.wikimedia.org/repos/releng/release/-/commits/main/make-container-image/build-images.py and https://gitlab.wikimedia.org/repos/releng/scap/-/commits/master/scap/config.py though, maybe I’m wrong
[13:16:04] <Lucas_WMDE>	 (via codesearch for “singleversion”)
[13:16:28] <James_F>	 Nope, https://gitlab.wikimedia.org/repos/releng/scap/-/commit/c3080ce4a87513720b0c5720b53e2dc7f2b3b47e was 7 months ago
[13:17:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Temporarily depool puppetserver1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/1270441
[13:17:42] <Lucas_WMDE>	 it finished building (after also pushing multiversion-debug and multiversion-cli)
[13:17:57] <James_F>	 Finally!
[13:18:24] <Lucas_WMDE>	 sergi0: how risky is your change, more or less?
[13:18:50] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1006.eqiad.wmnet with reason: host reimage
[13:19:12] <sergi0>	  Lucas_WMDE: worst produces another validation error instead of fixing the existing. Low risk I'd say
[13:19:41] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-redacteddb1001.eqiad.wmnet with OS bookworm
[13:19:49] <Lucas_WMDE>	 ok, then we can probably combine it with another config change or two
[13:19:55] <sergi0>	 sgtm
[13:19:55] <Lucas_WMDE>	 aude: is your config change testable btw?
[13:20:04] <aude>	 yes spot checking
[13:20:04] <Lucas_WMDE>	 (I suspect “not without registering a new account”)
[13:20:07] <Lucas_WMDE>	 ok
[13:20:10] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host an-redacteddb1001
[13:20:11] <Lucas_WMDE>	 should be there soon
[13:20:13] <aude>	 i can make a test account
[13:20:53] <wikibugs>	 (03PS1) 10Kevin Bazira: istio-proxy: fix Lua script in EnvoyFilter to correctly rewrite KServe batcher error responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270442 (https://phabricator.wikimedia.org/T422482)
[13:21:09] <wikibugs>	 (03Merged) 10jenkins-bot: Re-add p-personal id to the user menu [skins/Vector] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270043 (https://phabricator.wikimedia.org/T422885) (owner: 10Aude)
[13:21:16] <Lucas_WMDE>	 yay, backport made it through in the meantime
[13:21:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 aude, lucaswerkmeister-wmde: Backport for [[gerrit:1270016|Opt-in new accounts to ReadingLists beta feature on pilot wikis (T422833)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:21:29] <stashbot>	 T422833: Start opting in new accounts on the pilot wikis (arwiki, frwiki, zhwiki, idwiki and viwiki) - https://phabricator.wikimedia.org/T422833
[13:21:30] <Lucas_WMDE>	 so that’s up next, then probably all the other config changes together, then MichaelG_WMF’s backports
[13:21:33] <Lucas_WMDE>	 aude: please test :)
[13:21:53] <Lucas_WMDE>	 (I’m judging James_F’s config change to be low risk as well)
[13:22:46] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] istio-proxy: fix Lua script in EnvoyFilter to correctly rewrite KServe batcher error responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270442 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[13:23:12] <logmsgbot>	 btullis@cumin1003 reimage (PID 2391205) is awaiting input
[13:23:52] <aude>	 doesn't seem to opt in new accounts to the beta feature yet, but maybe i have to wait a bit or can do a follow up config change later.
[13:24:01] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1019.eqiad.wmnet with OS trixie
[13:24:02] <aude>	 otherwise, everything is okay
[13:24:03] <Lucas_WMDE>	 hm
[13:24:12] <aude>	 like feel free to proceed
[13:24:13] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors: - clo...
[13:24:19] <Lucas_WMDE>	 ok
[13:24:22] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[13:24:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 aude, lucaswerkmeister-wmde: Continuing with sync
[13:24:25] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Security updates
[13:24:25] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[13:24:27] <Lucas_WMDE>	 but it worked on testwiki?
[13:24:31] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[13:24:31] <aude>	 yes
[13:24:31] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Security updates
[13:24:34] <Lucas_WMDE>	 I’% just looking at the timestamp again… looks correct to me
[13:24:36] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie
[13:24:49] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
[13:24:58] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T419635)', diff saved to https://phabricator.wikimedia.org/P90502 and previous config saved to /var/cache/conftool/dbconfig/20260413-132457-fceratto.json
[13:25:01] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:25:03] <Lucas_WMDE>	 (*I’m)
[13:25:17] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2221.codfw.wmnet with reason: Maintenance
[13:25:42] <Lucas_WMDE>	 maybe some part of the account signup flow didn’t have X-Wikimedia-Debug applied 🤔
[13:25:43] <Lucas_WMDE>	 no idea
[13:25:49] <Lucas_WMDE>	 anyway, you can debug that later at your leisure ^^
[13:26:05] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90503 and previous config saved to /var/cache/conftool/dbconfig/20260413-132604-fceratto.json
[13:26:13] <aude>	 ah maybe that's it
[13:27:19] <aude>	 or it mw-debug.eqiad.pinkunicorn-6956bb54cc-lskrx and it'sworking
[13:28:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1270436 (owner: 10Blake)
[13:29:33] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "My bad, I needed lunch '^^ LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah)
[13:29:53] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] istio-proxy: fix Lua script in EnvoyFilter to correctly rewrite KServe batcher error responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270442 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[13:30:03] <logmsgbot>	 btullis@cumin1003 reimage (PID 2391205) is awaiting input
[13:31:28] <wikibugs>	 (03CR) 10Bking: [C:03+2] nginx tls proxy: remove defunct directive [puppet] - 10https://gerrit.wikimedia.org/r/1270084 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[13:35:14] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host an-redacteddb1001 - btullis@cumin1003"
[13:35:19] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host an-redacteddb1001 - btullis@cumin1003"
[13:35:19] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:35:19] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-redacteddb1001.eqiad.wmnet 18.48.64.10.in-addr.arpa 8.1.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:35:22] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-redacteddb1001.eqiad.wmnet 18.48.64.10.in-addr.arpa 8.1.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:35:23] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-redacteddb1001
[13:35:38] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814289 (10Jclark-ctr) Found a fried circuit on the board. Replaced the board and moved the CPUs over since the new ones did not match. The fault still continued on the ne...
[13:35:59] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1011 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Search
[13:36:04] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-redacteddb1001
[13:36:04] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host an-redacteddb1001
[13:36:45] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1012 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2026-07-05 07:49:09 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Search
[13:37:02] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[13:37:38] <wikibugs>	 (03Merged) 10jenkins-bot: istio-proxy: fix Lua script in EnvoyFilter to correctly rewrite KServe batcher error responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270442 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[13:37:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270016|Opt-in new accounts to ReadingLists beta feature on pilot wikis (T422833)]] (duration: 34m 09s)
[13:38:00] <stashbot>	 T422833: Start opting in new accounts on the pilot wikis (arwiki, frwiki, zhwiki, idwiki and viwiki) - https://phabricator.wikimedia.org/T422833
[13:38:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1270043|Re-add p-personal id to the user menu (T422885)]]
[13:38:42] <stashbot>	 T422885: #p-personal disappeared - https://phabricator.wikimedia.org/T422885
[13:40:08] <logmsgbot>	 jclark@cumin1003 reimage (PID 2361809) is awaiting input
[13:40:42] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90504 and previous config saved to /var/cache/conftool/dbconfig/20260413-134041-fceratto.json
[13:40:45] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:41:33] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[13:41:34] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host phab1006.eqiad.wmnet with OS trixie
[13:41:39] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11814316 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host phab1006.eqiad.wmnet with OS trixie completed: - phab1006 (**PASS**)...
[13:41:58] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11814319 (10Jclark-ctr)
[13:42:07] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11814320 (10Jclark-ctr) 05Open→03Resolved
[13:42:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aude: Backport for [[gerrit:1270043|Re-add p-personal id to the user menu (T422885)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:42:25] <aude>	 checking
[13:42:32] <Lucas_WMDE>	 thanks
[13:42:55] <aude>	 looks good
[13:42:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aude: Continuing with sync
[13:42:59] <Lucas_WMDE>	 yay
[13:43:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2070.codfw.wmnet with OS bullseye
[13:43:43] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11814327 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye
[13:44:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2070
[13:44:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[13:45:00] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270447
[13:49:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270043|Re-add p-personal id to the user menu (T422885)]] (duration: 10m 41s)
[13:49:24] <stashbot>	 T422885: #p-personal disappeared - https://phabricator.wikimedia.org/T422885
[13:49:33] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2070 - mvernon@cumin2002"
[13:49:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2070 - mvernon@cumin2002"
[13:49:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:49:39] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2070.codfw.wmnet 86.0.192.10.in-addr.arpa 6.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:49:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno)
[13:49:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270388 (owner: 10Jforrester)
[13:49:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) (owner: 10Anzx)
[13:49:44] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2070.codfw.wmnet 86.0.192.10.in-addr.arpa 6.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:49:45] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2070
[13:50:09] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2070
[13:50:09] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2070
[13:51:10] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: remove unused contextual attributes causing problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268965 (https://phabricator.wikimedia.org/T422001) (owner: 10Sergio Gimeno)
[13:51:31] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P90505 and previous config saved to /var/cache/conftool/dbconfig/20260413-135129-fceratto.json
[13:51:31] <wikibugs>	 (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270448
[13:51:36] <wikibugs>	 (03Merged) 10jenkins-bot: [abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270388 (owner: 10Jforrester)
[13:51:45] <James_F>	 Yay.
[13:51:51] <wikibugs>	 (03Merged) 10jenkins-bot: urwikisource: add مصنف (author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269788 (https://phabricator.wikimedia.org/T422824) (owner: 10Anzx)
[13:52:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1268965|EventStreamConfig: remove unused contextual attributes causing problems (T422001)]], [[gerrit:1270388|[abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s]], [[gerrit:1269788|urwikisource: add مصنف (author) namespace (T422824)]]
[13:52:12] <stashbot>	 T422001: '.performer.active_browsing_session_token' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T422001
[13:52:13] <stashbot>	 T422824: Add author namespace to urwikisource - https://phabricator.wikimedia.org/T422824
[13:52:20] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: host reimage
[13:52:53] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270384 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[13:52:58] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[13:53:02] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm)
[13:53:07] <moritzm>	 !log installing postgresql-common bugfix updates
[13:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:13] <MichaelG_WMF>	 🤞
[13:53:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 sgimeno, anzx, lucaswerkmeister-wmde, jforrester: Backport for [[gerrit:1268965|EventStreamConfig: remove unused contextual attributes causing problems (T422001)]], [[gerrit:1270388|[abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s]], [[gerrit:1269788|urwikisource: add مصنف (author) namespace (T422824)]] synced to the testservers (see https://wikitec
[13:53:48] <logmsgbot>	 h.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:53:58] <anzx>	 looking 
[13:54:02] * sergi0 checking
[13:54:12] <Lucas_WMDE>	 James_F: please also test ^^
[13:54:16] <James_F>	 Testing.
[13:54:41] <anzx>	 Lucas_WMDE: looks good to sync
[13:54:51] <Lucas_WMDE>	 ack
[13:54:52] <James_F>	 Looks good from my end.
[13:55:36] <MichaelG_WMF>	 > stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/GeoData/': GnuTLS recv error (-54): Error in the pull function.'
[13:55:41] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:55:45] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:55:51] <Lucas_WMDE>	 love to see it
[13:55:53] <MichaelG_WMF>	 (one of the changes that just got a +2 failed on a git error)
[13:56:34] <sergi0>	 Lucas_WMDE: lgtm
[13:56:36] <wikibugs>	 (03Merged) 10jenkins-bot: Record TOR account creation failure separately [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270384 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[13:56:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] stats: add counters for experiment account creation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[13:56:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 sgimeno, anzx, lucaswerkmeister-wmde, jforrester: Continuing with sync
[13:56:44] <Lucas_WMDE>	 thanks!
[13:57:07] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "try again (T421827)" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[13:58:53] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[13:59:22] <anzx>	 Lucas_WMDE: please run namespacedupes for urwikisource once sync is finished 
[13:59:35] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: host reimage
[14:00:31] <Lucas_WMDE>	 right, thanks for the reminder
[14:00:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268965|EventStreamConfig: remove unused contextual attributes causing problems (T422001)]], [[gerrit:1270388|[abstractwiki] Enable wgParserEnableUserLanguage, so we don't need {{int:lang}}s]], [[gerrit:1269788|urwikisource: add مصنف (author) namespace (T422824)]] (duration: 08m 30s)
[14:00:41] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "PCC seems to be noop in practice: https://puppet-compiler.wmflabs.org/output/1270432/6373/db1151.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff)
[14:00:43] <stashbot>	 T422001: '.performer.active_browsing_session_token' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T422001
[14:00:43] <stashbot>	 T422824: Add author namespace to urwikisource - https://phabricator.wikimedia.org/T422824
[14:00:47] <Lucas_WMDE>	 (backports need a few more minutes in CI anyway)
[14:01:09] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[14:01:40] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management: Stuck-hidden file - https://phabricator.wikimedia.org/T423065#11814472 (10KylieTastic) I have also just noticed that files where I deleted old revisions, such as https://en.wikipedia.org/wiki/File:SuccessKid.jpg, the old versions do not  show "No thumbnail" but...
[14:01:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes urwikisource --fix  # T422824
[14:02:19] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P90506 and previous config saved to /var/cache/conftool/dbconfig/20260413-140218-fceratto.json
[14:02:23] <Lucas_WMDE>	 anzx: done
[14:02:33] <anzx>	 Lucas_WMDE: thanks for deploying 
[14:02:46] <Lucas_WMDE>	 jouncebot: nowandnext
[14:02:46] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 27 minute(s)
[14:02:46] <jouncebot>	 In 0 hour(s) and 27 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1430)
[14:02:51] <Lucas_WMDE>	 we’re technically past the end of the window but let’s still do the backports
[14:03:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[14:03:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm)
[14:04:23] <wikibugs>	 (03PS1) 10Brouberol: growthbook: release unofficial build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270450 (https://phabricator.wikimedia.org/T420781)
[14:04:57] <MichaelG_WMF>	 thanks!
[14:07:00] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthSuggestionToneCheck: flag as non-experimental [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm)
[14:07:43] <wikibugs>	 (03CR) 10Marostegui: "You can create one and then protect it as security issue and then it can be tested with that one." [cookbooks] - 10https://gerrit.wikimedia.org/r/1270060 (https://phabricator.wikimedia.org/T422460) (owner: 10Federico Ceratto)
[14:08:40] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:09:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage
[14:09:57] <wikibugs>	 (03CR) 10JMeybohm: rest-gateway: Add liftwing listeners and network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[14:10:27] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:11:21] <wikibugs>	 (03Merged) 10jenkins-bot: stats: add counters for experiment account creation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270383 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große)
[14:11:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1270384|Record TOR account creation failure separately (T422283)]], [[gerrit:1270383|stats: add counters for experiment account creation (T422283)]], [[gerrit:1269495|GrowthSuggestionToneCheck: flag as non-experimental (T422835)]]
[14:11:44] <stashbot>	 T422283: [V1 experiment changes] Enable reliable measurement of account creation for mobile registration experiment on auth.wikimedia.org domain and support broader rollout - https://phabricator.wikimedia.org/T422283
[14:11:44] <stashbot>	 T422835: Revise Tone tasks are warning users with "Experimental edit check. For testing purposes only." warning - https://phabricator.wikimedia.org/T422835
[14:12:26] <Lucas_WMDE>	 at last
[14:12:37] <Lucas_WMDE>	 [ian mcdiarmid voice]
[14:12:39] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc1011: T419961
[14:12:39] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[14:13:07] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T419635)', diff saved to https://phabricator.wikimedia.org/P90507 and previous config saved to /var/cache/conftool/dbconfig/20260413-141306-fceratto.json
[14:13:11] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:13:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 migr, lucaswerkmeister-wmde, urbanecm: Backport for [[gerrit:1270384|Record TOR account creation failure separately (T422283)]], [[gerrit:1270383|stats: add counters for experiment account creation (T422283)]], [[gerrit:1269495|GrowthSuggestionToneCheck: flag as non-experimental (T422835)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be
[14:13:16] <logmsgbot>	 verified there.
[14:13:20] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[14:13:20] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1011: T419961
[14:13:27] <logmsgbot>	 !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: Maintenance
[14:13:48] <Lucas_WMDE>	 MichaelG_WMF: please test!
[14:13:49] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2011: T419961
[14:13:49] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[14:14:06] <MichaelG_WMF>	 will test
[14:14:11] <inflatador>	 !log bking@apt1002 sudo -E reprepro --ignore=wrongdistribution -C component/opensearch2 include trixie-wikimedia  ~/opensearch-madvise-0.2/opensearch-madvise_0.2_amd64.changes T422860
[14:14:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:14] <stashbot>	 T422860: Migrate Cloudelastic to OpenSearch 2.x - https://phabricator.wikimedia.org/T422860
[14:14:15] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P90509 and previous config saved to /var/cache/conftool/dbconfig/20260413-141414-fceratto.json
[14:14:19] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[14:14:19] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2011: T419961
[14:14:44] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1012: Security updates
[14:14:44] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[14:14:51] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2070.codfw.wmnet with reason: host reimage
[14:14:51] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[14:14:51] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1012: Security updates
[14:15:02] <MichaelG_WMF>	 Lucas_WMDE: The experimental warning is no longer there! Good to move forward 👍
[14:15:12] <wikibugs>	 (03PS1) 10Marostegui: db2224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270451 (https://phabricator.wikimedia.org/T422777)
[14:15:19] <logmsgbot>	 jclark@cumin1003 reimage (PID 2449143) is awaiting input
[14:15:29] <wikibugs>	 (03CR) 10Bearloga: [C:03+1] growthbook: release unofficial build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270450 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol)
[14:15:52] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2224.codfw.wmnet with reason: Reimage to Trixie
[14:16:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] growthbook: release unofficial build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270450 (https://phabricator.wikimedia.org/T420781) (owner: 10Brouberol)
[14:16:13] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2224: Reimage
[14:16:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270451 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui)
[14:16:32] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2224: Reimage
[14:17:08] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:18:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 migr, lucaswerkmeister-wmde, urbanecm: Continuing with sync
[14:18:20] <Lucas_WMDE>	 MichaelG_WMF: thanks! sorry, got distracted
[14:18:59] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[14:19:01] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:19:21] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2224.codfw.wmnet with OS trixie
[14:19:45] <wikibugs>	 (03PS1) 10Majavah: P:rsyslog: Update Keystone unit file names [puppet] - 10https://gerrit.wikimedia.org/r/1270452 (https://phabricator.wikimedia.org/T421911)
[14:19:55] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[14:19:58] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[14:20:13] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc2 on pc2012 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1012.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1012.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:20:27] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-redacteddb1001.eqiad.wmnet with OS bookworm
[14:22:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270384|Record TOR account creation failure separately (T422283)]], [[gerrit:1270383|stats: add counters for experiment account creation (T422283)]], [[gerrit:1269495|GrowthSuggestionToneCheck: flag as non-experimental (T422835)]] (duration: 10m 22s)
[14:22:05] <stashbot>	 T422283: [V1 experiment changes] Enable reliable measurement of account creation for mobile registration experiment on auth.wikimedia.org domain and support broader rollout - https://phabricator.wikimedia.org/T422283
[14:22:06] <stashbot>	 T422835: Revise Tone tasks are warning users with "Experimental edit check. For testing purposes only." warning - https://phabricator.wikimedia.org/T422835
[14:22:22] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:16] <Amir1>	 jouncebot: next
[14:23:16] <jouncebot>	 In 0 hour(s) and 6 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1430)
[14:23:17] <wikibugs>	 (03PS1) 10Bearloga: EventStreamConfig: remove ABST contextual attribute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001)
[14:23:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Allow to easily disable puppet-merges temporarily - https://phabricator.wikimedia.org/T423121#11814631 (10CDanis) Sounds a lot like {T248872} ?
[14:25:42] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[14:26:14] <icinga-wm>	 RECOVERY - MariaDB Replica IO: pc2 on pc2012 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:26:22] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert)
[14:27:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Update debdeploy to use checkrestart instead of lsof to detect library restarts - https://phabricator.wikimedia.org/T422614#11814638 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:28:14] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc2 on pc2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:28:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251#11814639 (10AnneT) @MoritzMuehlenhoff apologies; I was out last week. I've confirmed that I can now access my experiment data in superset - thanks very much!
[14:28:52] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P90512 and previous config saved to /var/cache/conftool/dbconfig/20260413-142851-fceratto.json
[14:28:56] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:29:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1430)
[14:32:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[14:34:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:tofurkey Add tofurkey (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede)
[14:34:37] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2070.codfw.wmnet with OS bullseye
[14:34:46] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11814687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2070.codfw.wmnet with OS bullseye compl...
[14:35:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186#11814699 (10LSobanski) 05Open→03Declined
[14:36:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2224.codfw.wmnet with reason: host reimage
[14:37:30] <wikibugs>	 (03PS2) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507)
[14:38:13] <wikibugs>	 (03CR) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski)
[14:39:41] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P90513 and previous config saved to /var/cache/conftool/dbconfig/20260413-143939-fceratto.json
[14:39:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS bullseye
[14:40:05] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11814730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye
[14:40:09] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2224.codfw.wmnet with reason: host reimage
[14:40:26] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2069
[14:41:00] <wikibugs>	 (03PS1) 10FNegri: mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806)
[14:41:02] <wikibugs>	 (03PS1) 10FNegri: mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806)
[14:43:09] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2224: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270467
[14:43:29] <logmsgbot>	 mvernon@cumin2002 reimage (PID 2300741) is awaiting input
[14:43:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610#11814753 (10LSobanski) 05Open→03Declined The effort of moving accounts to new UIDs is too high.
[14:44:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Rake tasks: add colours and buffer output - https://phabricator.wikimedia.org/T237508#11814778 (10jhathaway) 05Open→03Declined I don't think buffering the output is always wanted, as you may want to see the first error as quickly as possible, so dec...
[14:44:58] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[14:46:19] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Core: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609#11814782 (10LSobanski) 05Open→03Resolved a:03LSobanski Should have been addressed with upgrade to Puppet 7, please reopen if you st...
[14:47:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: puppet-merge shouldn't fail if `tput` doesn't grok your terminal - https://phabricator.wikimedia.org/T221985#11814801 (10LSobanski) 05Open→03Declined
[14:48:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577#11814809 (10LSobanski) 05Open→03Declined Shouldn't be a problem with today's infrastructure.
[14:48:50] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2069 - mvernon@cumin2002"
[14:48:55] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2069 - mvernon@cumin2002"
[14:48:56] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:48:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2069.codfw.wmnet 181.48.192.10.in-addr.arpa 1.8.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:49:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2069.codfw.wmnet 181.48.192.10.in-addr.arpa 1.8.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:49:01] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2069
[14:49:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Puppet wmf-style-guide: array of classes not detected properly - https://phabricator.wikimedia.org/T179230#11814825 (10LSobanski) p:05Medium→03Low
[14:50:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: more verbose hiera messages on failures - https://phabricator.wikimedia.org/T109692#11814828 (10LSobanski) 05Open→03Declined Closing, please reopen if still a problem on the current Puppet version.
[14:50:29] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P90514 and previous config saved to /var/cache/conftool/dbconfig/20260413-145028-fceratto.json
[14:50:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: reprepro: automate incoming processing - https://phabricator.wikimedia.org/T215812#11814830 (10MoritzMuehlenhoff) p:05Medium→03Low
[14:50:57] <wikibugs>	 (03PS1) 10Kamila Součková: Revert "Enable $wgTempCategoryCollations for testwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270470 (https://phabricator.wikimedia.org/T422546)
[14:51:55] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2069
[14:51:55] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2069
[14:52:14] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc2 on pc2012 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:52:19] <wikibugs>	 (03PS1) 10Kamila Součková: Revert "Temporarily add shellbox-icu to $wgShellboxUrls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270472 (https://phabricator.wikimedia.org/T422546)
[14:53:12] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2012: T419961
[14:53:12] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[14:53:25] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[14:53:25] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2012: T419961
[14:53:33] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc1012: T419961
[14:53:34] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache
[14:53:45] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[14:53:45] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1012: T419961
[14:54:15] <wikibugs>	 (03Abandoned) 10Kamila Součková: Temporarily add shellbox-icu ClusterIP endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková)
[14:54:36] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS trixie
[14:54:51] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
[14:54:54] <wikibugs>	 (03CR) 10Ladsgroup: "Do we have "client" wikis?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe)
[14:55:50] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:00:10] <Amir1>	 jouncebot: nowandnext
[15:00:10] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 29 minute(s)
[15:00:10] <jouncebot>	 In 0 hour(s) and 29 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1530)
[15:00:20] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T410589)', diff saved to https://phabricator.wikimedia.org/P90516 and previous config saved to /var/cache/conftool/dbconfig/20260413-150019-ladsgroup.json
[15:00:24] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[15:00:37] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1166: repool after maintenance
[15:00:45] <wikibugs>	 (03PS1) 10Kevin Bazira: istio-proxy: move kserve-batcher-json-error-rewrite EnvoyFilter to istio-system ns to cover production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270476 (https://phabricator.wikimedia.org/T422482)
[15:00:50] <jinxer-wm>	 RESOLVED: [22x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:01:18] <logmsgbot>	 !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T419635)', diff saved to https://phabricator.wikimedia.org/P90518 and previous config saved to /var/cache/conftool/dbconfig/20260413-150116-fceratto.json
[15:01:27] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:02:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2224: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270467 (owner: 10Marostegui)
[15:03:15] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2224.codfw.wmnet with OS trixie
[15:03:16] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] istio-proxy: move kserve-batcher-json-error-rewrite EnvoyFilter to istio-system ns to cover production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270476 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[15:04:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:04:18] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2224: After Reimage
[15:04:53] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2224: After Reimage
[15:05:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2224: After Reimage
[15:05:12] <wikibugs>	 (03PS2) 10FNegri: mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806)
[15:05:12] <wikibugs>	 (03PS2) 10FNegri: mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806)
[15:06:34] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:07:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:07:58] <wikibugs>	 (03PS1) 10Marostegui: db1187.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270478 (https://phabricator.wikimedia.org/T422777)
[15:08:22] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1187: Upgrade package
[15:08:43] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1187.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1270478 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui)
[15:08:44] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] istio-proxy: move kserve-batcher-json-error-rewrite EnvoyFilter to istio-system ns to cover production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270476 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[15:08:50] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1187.eqiad.wmnet with reason: Reimage to Trixie
[15:08:50] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1187: Upgrade package
[15:09:11] <wikibugs>	 (03PS10) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475)
[15:09:33] <wikibugs>	 (03CR) 10Elukey: tox: rework venvs to speed up local and CI timings (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[15:09:54] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1013: Security updates
[15:09:55] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[15:10:02] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[15:10:02] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1013: Security updates
[15:10:27] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P90522 and previous config saved to /var/cache/conftool/dbconfig/20260413-151027-ladsgroup.json
[15:10:43] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1187.eqiad.wmnet with OS trixie
[15:11:40] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:12:52] <wikibugs>	 (03PS1) 10Hnowlan: prometheus, thanos: move recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663)
[15:13:42] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:14:14] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:15:03] <wikibugs>	 06SRE, 10Lift-Wing, 06Machine-Learning-Team: Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11815008 (10DPogorzelski-WMF)
[15:16:14] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc3 on pc2013 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1013.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1013.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:16:49] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:16:56] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:17:00] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Allow WMDE Airflow instance to egress to dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269978 (https://phabricator.wikimedia.org/T414583) (owner: 10Andrew McAllister (WMDE))
[15:17:11] <wikibugs>	 (03Merged) 10jenkins-bot: istio-proxy: move kserve-batcher-json-error-rewrite EnvoyFilter to istio-system ns to cover production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270476 (https://phabricator.wikimedia.org/T422482) (owner: 10Kevin Bazira)
[15:20:14] <icinga-wm>	 RECOVERY - MariaDB Replica IO: pc3 on pc2013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:20:15] <wikibugs>	 (03PS1) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850)
[15:20:35] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P90526 and previous config saved to /var/cache/conftool/dbconfig/20260413-152034-ladsgroup.json
[15:20:48] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[15:20:51] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[15:21:01] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:21:07] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:21:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:21:17] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[15:21:19] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[15:21:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:21:49] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:21:52] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:21:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[15:21:59] <jinxer-wm>	 Deployment aqs-http-gateway-main in editor-analytics at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=editor-analytics&var-deployment=aqs-http-gateway-main - ...
[15:21:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[15:22:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:25:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:25:41] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage
[15:26:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for passimacopoulos [puppet] - 10https://gerrit.wikimedia.org/r/1270483
[15:26:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:27:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:27:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for passimacopoulos [puppet] - 10https://gerrit.wikimedia.org/r/1270483 (owner: 10Muehlenhoff)
[15:27:37] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] EventStreamConfig: remove ABST contextual attribute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270454 (https://phabricator.wikimedia.org/T422001) (owner: 10Bearloga)
[15:28:09] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815072 (10Marostegui) 05Open→03Resolved Thanks John for trying swapping many parts - unfortunately it didn't work so I am going to close this task and open a new...
[15:29:06] <logmsgbot>	 mvernon@cumin2002 reimage (PID 2300741) is awaiting input
[15:29:10] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815088 (10Marostegui)
[15:29:20] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[15:29:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:29:28] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage
[15:30:05] <jouncebot>	 jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1530). nyaa~
[15:30:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:30:43] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T410589)', diff saved to https://phabricator.wikimedia.org/P90527 and previous config saved to /var/cache/conftool/dbconfig/20260413-153042-ladsgroup.json
[15:30:46] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[15:30:59] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[15:31:07] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T410589)', diff saved to https://phabricator.wikimedia.org/P90529 and previous config saved to /var/cache/conftool/dbconfig/20260413-153107-ladsgroup.json
[15:31:24] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:31:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:31:44] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:32:01] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:32:06] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:33:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:35:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:36:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:36:52] <moritzm>	 !log installing postgresql-15 security updates
[15:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] "I suspect this is necessary but insufficient" [puppet] - 10https://gerrit.wikimedia.org/r/1270452 (https://phabricator.wikimedia.org/T421911) (owner: 10Majavah)
[15:37:23] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1013: Security updates
[15:37:23] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[15:37:34] <wikibugs>	 (03PS3) 10FNegri: mariadb: wiki-replicas: remove redundant grants [puppet] - 10https://gerrit.wikimedia.org/r/1270464 (https://phabricator.wikimedia.org/T422806)
[15:37:34] <wikibugs>	 (03PS3) 10FNegri: mariadb: wiki-replicas: add grants for %_maintain [puppet] - 10https://gerrit.wikimedia.org/r/1270465 (https://phabricator.wikimedia.org/T422806)
[15:37:35] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[15:37:35] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1013: Security updates
[15:39:18] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage
[15:40:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:41:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:41:27] <wikibugs>	 (03PS2) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850)
[15:42:31] <wikibugs>	 (03CR) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[15:43:35] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:43:40] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[15:43:54] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage
[15:46:05] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1166: repool after maintenance
[15:46:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1187.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270485
[15:48:15] <wikibugs>	 (03PS1) 10CDanis: check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230)
[15:48:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230) (owner: 10CDanis)
[15:49:30] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance
[15:49:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T419635)', diff saved to https://phabricator.wikimedia.org/P90533 and previous config saved to /var/cache/conftool/dbconfig/20260413-154937-fceratto.json
[15:49:41] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:50:30] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2224: After Reimage
[15:50:40] <wikibugs>	 (03PS2) 10CDanis: check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230)
[15:51:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230) (owner: 10CDanis)
[15:51:25] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1187.eqiad.wmnet with OS trixie
[15:52:02] <wikibugs>	 (03PS3) 10CDanis: check_wmf_styleguide: handle array notation in class declarations [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/1270486 (https://phabricator.wikimedia.org/T179230)
[15:52:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T419635)', diff saved to https://phabricator.wikimedia.org/P90535 and previous config saved to /var/cache/conftool/dbconfig/20260413-155253-fceratto.json
[15:53:05] <wikibugs>	 (03CR) 10Elukey: amg-gpu: Set up explicit GPU partitioning (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski)
[15:55:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1187.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1270485 (owner: 10Marostegui)
[15:56:25] <wikibugs>	 (03PS2) 10Elukey: istio: revisit Prometheus buckets for Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886)
[15:56:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "LGTM; should be deployed after I98225a2309 has been merged (but doesn’t need a Depends-On in the commit message, otherwise scap would refu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[15:56:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven)
[15:56:51] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) (owner: 10Cwhite)
[15:58:41] <wikibugs>	 (03CR) 10Elukey: "It turns out that Wikikube emits ~700k time series every 5m, while the ML clusters one order of magnitude less." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269998 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[15:59:15] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1187: After Reimage
[16:02:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2069.codfw.wmnet with OS bullseye
[16:02:08] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11815292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye compl...
[16:02:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11815293 (10MoritzMuehlenhoff)
[16:03:03] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P90537 and previous config saved to /var/cache/conftool/dbconfig/20260413-160301-fceratto.json
[16:07:50] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1014: Security updates
[16:07:50] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[16:07:57] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[16:07:57] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1014: Security updates
[16:09:16] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:13:11] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P90539 and previous config saved to /var/cache/conftool/dbconfig/20260413-161310-fceratto.json
[16:13:20] <Amir1>	 jouncebot: nowandnext
[16:13:20] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 46 minute(s)
[16:13:20] <jouncebot>	 In 0 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1700)
[16:13:20] <jouncebot>	 In 0 hour(s) and 46 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1700)
[16:14:19] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc4 on pc2014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1014.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1014.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:14:48] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1019.eqiad.wmnet with OS trixie
[16:15:01] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors: -...
[16:17:11] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableEventGoals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150)
[16:17:31] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableEventGoals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150)
[16:18:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270490 (https://phabricator.wikimedia.org/T414150) (owner: 10Daimona Eaytoy)
[16:19:20] <icinga-wm>	 RECOVERY - MariaDB Replica IO: pc4 on pc2014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:22:18] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto)
[16:22:26] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto)
[16:23:08] <wikibugs>	 (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto)
[16:23:19] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T419635)', diff saved to https://phabricator.wikimedia.org/P90541 and previous config saved to /var/cache/conftool/dbconfig/20260413-162318-fceratto.json
[16:23:22] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:23:37] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance
[16:23:45] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T419635)', diff saved to https://phabricator.wikimedia.org/P90542 and previous config saved to /var/cache/conftool/dbconfig/20260413-162344-fceratto.json
[16:26:58] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T419635)', diff saved to https://phabricator.wikimedia.org/P90543 and previous config saved to /var/cache/conftool/dbconfig/20260413-162657-fceratto.json
[16:28:17] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1273.eqiad.wmnet
[16:28:54] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1273.eqiad.wmnet
[16:30:16] <wikibugs>	 10ops-codfw, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159 (10Jhancock.wm) 03NEW
[16:34:16] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:35:20] <Amir1>	 !log banning non-standard thumbs with external referrer regardless of cache status (T414805)
[16:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:23] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[16:35:53] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1014: Security updates
[16:35:53] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[16:36:07] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[16:36:07] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1014: Security updates
[16:37:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P90546 and previous config saved to /var/cache/conftool/dbconfig/20260413-163706-fceratto.json
[16:37:56] <wikibugs>	 (03PS9) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830)
[16:37:56] <wikibugs>	 (03PS9) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830)
[16:37:56] <wikibugs>	 (03PS9) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830)
[16:40:01] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[16:40:46] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[16:41:15] <wikibugs>	 (03PS2) 10Ladsgroup: envoy: Close connections to swift after 10s of inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1270031 (https://phabricator.wikimedia.org/T328872)
[16:41:21] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] envoy: Close connections to swift after 10s of inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1270031 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup)
[16:44:05] <mutante>	 !log contint2002 (prod CI) - re-enabled puppet - this applied a refresh of the contint.wikimedia.org certificate 
[16:44:05] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[7002-7008].magru.wmnet} and A:cp - 9.2.13 Upgrade ()
[16:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:40] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1187: After Reimage
[16:44:45] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[7010-7016].magru.wmnet} and A:cp - 9.2.13 Upgrade ()
[16:46:09] <mutante>	 !log contint2002 (prod CI) - re-enabled puppet - this applied a refresh of the contint.wikimedia.org certificate  (T423152 T420993)
[16:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:14] <stashbot>	 T423152: PuppetDisabled - contint2002 - https://phabricator.wikimedia.org/T423152
[16:46:15] <stashbot>	 T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993
[16:46:27] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159#11815664 (10Ladsgroup) Hi, that's for sre-collab team since they own mailman now!
[16:47:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P90548 and previous config saved to /var/cache/conftool/dbconfig/20260413-164713-fceratto.json
[16:47:25] <wikibugs>	 (03PS1) 10CDanis: aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486)
[16:50:38] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Temporarily depool puppetserver1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/1270441 (owner: 10Muehlenhoff)
[16:50:51] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wikimedia.org: Restore original TTL for dumps [dns] - 10https://gerrit.wikimedia.org/r/1270363 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah)
[16:51:44] <wikibugs>	 06SRE: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168 (10Scott_French) 03NEW
[16:52:47] <wikibugs>	 06SRE: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815713 (10Scott_French) I've verified that manually deleting an editor-analytics pod in staging will trigger crash looping, and then setting initialDelaySeconds on the liveness probe (in t...
[16:53:28] <wikibugs>	 (03PS1) 10Scott French: aqs2-common: Remove decommed aqs1012 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270496 (https://phabricator.wikimedia.org/T423168)
[16:53:53] <wikibugs>	 06SRE, 13Patch-For-Review: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815719 (10Scott_French) p:05Triage→03High
[16:55:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) (owner: 10Ladsgroup)
[16:55:19] <wikibugs>	 (03CR) 10Eevans: [C:03+1] aqs2-common: Remove decommed aqs1012 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270496 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French)
[16:56:35] <wikibugs>	 (03Merged) 10jenkins-bot: ExternalStore: Start reading and writing from clusters 32 and 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) (owner: 10Ladsgroup)
[16:56:44] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] provision: Workaround Supermicro BIOS to UEFI bug (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1262196 (https://phabricator.wikimedia.org/T393053) (owner: 10JHathaway)
[16:56:48] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]]
[16:56:52] <stashbot>	 T421729: Create cluster32 and cluster33 in existing es6 and es7 hosts - https://phabricator.wikimedia.org/T421729
[16:57:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T419635)', diff saved to https://phabricator.wikimedia.org/P90549 and previous config saved to /var/cache/conftool/dbconfig/20260413-165721-fceratto.json
[16:57:25] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:57:36] <wikibugs>	 06SRE, 13Patch-For-Review: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815745 (10Scott_French)
[16:57:39] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance
[16:57:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2161 (T419635)', diff saved to https://phabricator.wikimedia.org/P90550 and previous config saved to /var/cache/conftool/dbconfig/20260413-165747-fceratto.json
[16:58:24] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:58:43] <wikibugs>	 (03CR) 10Scott French: [C:03+2] aqs2-common: Remove decommed aqs1012 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270496 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French)
[16:59:22] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1700)
[17:00:05] <jouncebot>	 ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T1700).
[17:00:27] <swfrench-wmf>	 o/
[17:00:44] <swfrench-wmf>	 I'll be deploying to a handful of non-MediaWiki services during this window
[17:00:45] <wikibugs>	 (03PS1) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497
[17:01:00] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T419635)', diff saved to https://phabricator.wikimedia.org/P90551 and previous config saved to /var/cache/conftool/dbconfig/20260413-170059-fceratto.json
[17:01:41] <wikibugs>	 (03Merged) 10jenkins-bot: aqs2-common: Remove decommed aqs1012 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270496 (https://phabricator.wikimedia.org/T423168) (owner: 10Scott French)
[17:03:09] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[17:03:31] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]] (duration: 06m 43s)
[17:03:36] <stashbot>	 T421729: Create cluster32 and cluster33 in existing es6 and es7 hosts - https://phabricator.wikimedia.org/T421729
[17:06:13] <wikibugs>	 06SRE, 13Patch-For-Review: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815811 (10Eevans)
[17:06:41] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1017: Security updates
[17:06:41] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[17:06:49] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[17:06:49] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1017: Security updates
[17:07:04] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops: lists2001 has multiple bus errors - https://phabricator.wikimedia.org/T423159#11815840 (10Jhancock.wm) a:05Ladsgroup→03None np!
[17:10:43] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486) (owner: 10CDanis)
[17:11:08] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P90553 and previous config saved to /var/cache/conftool/dbconfig/20260413-171107-fceratto.json
[17:11:39] <wikibugs>	 (03PS2) 10CDanis: aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486)
[17:11:48] <wikibugs>	 (03CR) 10CDanis: [C:03+2] aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486) (owner: 10CDanis)
[17:12:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:13:41] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:13:57] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[17:14:11] <wikibugs>	 (03Merged) 10jenkins-bot: aux-k8s-services: update Jaeger Istio DestRule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270494 (https://phabricator.wikimedia.org/T414486) (owner: 10CDanis)
[17:15:13] <wikibugs>	 (03PS1) 10Ladsgroup: Revert^6 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270500
[17:17:45] <logmsgbot>	 jhancock@cumin2002 provision (PID 2451662) is awaiting input
[17:18:56] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[17:19:16] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Revert^6 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270500 (owner: 10Ladsgroup)
[17:19:29] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[17:19:57] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[17:20:18] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[17:21:16] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P90554 and previous config saved to /var/cache/conftool/dbconfig/20260413-172115-fceratto.json
[17:21:31] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:21:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270500 (owner: 10Ladsgroup)
[17:22:57] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^6 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270500 (owner: 10Ladsgroup)
[17:23:12] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1270500|Revert^6 "Use envoy for swift inside mediawiki"]]
[17:24:49] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1270500|Revert^6 "Use envoy for swift inside mediawiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:26:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 07Incident Severity 3: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11815922 (10MLechvien-WMF)
[17:26:52] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[7010-7016].magru.wmnet} and A:cp - 9.2.13 Upgrade ()
[17:26:57] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[17:26:58] <wikibugs>	 10ops-codfw, 06DC-Ops: wikikube-worker2190 System Configuration Check error - https://phabricator.wikimedia.org/T423175 (10Jhancock.wm) 03NEW
[17:27:45] <wikibugs>	 10ops-codfw, 06DC-Ops, 06ServiceOps new: wikikube-worker2190 System Configuration Check error - https://phabricator.wikimedia.org/T423175#11815952 (10Jhancock.wm)
[17:29:25] <wikibugs>	 06SRE, 13Patch-For-Review: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11815965 (10Scott_French) Plot twist:  Deploying https://gerrit.wikimedia.org/r/1270496 to editor-analytics staging failed, again with a (initial) liveness check timeou...
[17:29:48] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[7002-7008].magru.wmnet} and A:cp - 9.2.13 Upgrade ()
[17:30:43] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270500|Revert^6 "Use envoy for swift inside mediawiki"]] (duration: 07m 31s)
[17:31:19] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:31:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T419635)', diff saved to https://phabricator.wikimedia.org/P90555 and previous config saved to /var/cache/conftool/dbconfig/20260413-173123-fceratto.json
[17:31:27] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[17:31:41] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance
[17:31:49] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T419635)', diff saved to https://phabricator.wikimedia.org/P90556 and previous config saved to /var/cache/conftool/dbconfig/20260413-173148-fceratto.json
[17:32:12] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:33:12] <Amir1>	 !log dropping templatelinks and pagelinks on testcommonswiki core db (T421914)
[17:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:15] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[17:33:33] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[17:33:35] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1017: Security updates
[17:33:35] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[17:33:49] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[17:33:49] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1017: Security updates
[17:34:20] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[17:35:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T419635)', diff saved to https://phabricator.wikimedia.org/P90558 and previous config saved to /var/cache/conftool/dbconfig/20260413-173501-fceratto.json
[17:35:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:35:37] <wikibugs>	 (03CR) 10Jdlrobson: "recheck" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm)
[17:36:40] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[17:37:06] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[17:39:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:40:19] <swfrench-wmf>	 !log applied latent external-services network policy changes for aqs{1023,1024} - T423168
[17:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:22] <stashbot>	 T423168: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168
[17:40:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:41:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[17:41:44] <jinxer-wm>	 Deployment aqs-http-gateway-main in editor-analytics at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=editor-analytics&var-deployment=aqs-http-gateway-main - ...
[17:41:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:41:51] <swfrench-wmf>	 \o/
[17:42:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:42:53] <wikibugs>	 (03CR) 10Michael Große: "We can probably abandon this. It is for -wmf.22, and -wmf.23 has already been rolled out to all wikis last week." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm)
[17:44:40] <wikibugs>	 (03PS2) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497
[17:44:55] <wikibugs>	 10ops-codfw, 06DC-Ops, 06ServiceOps new: wikikube-worker2188 bus errors - https://phabricator.wikimedia.org/T423177 (10Jhancock.wm) 03NEW
[17:45:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P90559 and previous config saved to /var/cache/conftool/dbconfig/20260413-174509-fceratto.json
[17:45:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[17:45:41] <wikibugs>	 06SRE: aqs-http-gateway services at risk from defunct hosts in cassandra_hosts - https://phabricator.wikimedia.org/T423168#11816149 (10Scott_French) So, once the external-services network policy changes were applied, the crash-looping pod in editor-analytics was able to start successfully.  That means the `i/o t...
[17:46:13] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[17:46:31] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[17:47:06] <wikibugs>	 (03PS3) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497
[17:47:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[17:49:49] <wikibugs>	 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11816171 (10MLechvien-WMF)
[17:51:13] <wikibugs>	 06SRE: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816179 (10Scott_French)
[17:51:19] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[17:52:37] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_ulsfo - 9.2.13 Upgrade ()
[17:52:51] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_ulsfo - 9.2.13 Upgrade ()
[17:53:21] <wikibugs>	 (03PS4) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497
[17:55:18] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P90560 and previous config saved to /var/cache/conftool/dbconfig/20260413-175517-fceratto.json
[17:55:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[18:01:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:02:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:02:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:03:00] <wikibugs>	 (03PS5) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497
[18:03:35] <wikibugs>	 10ops-codfw, 06DC-Ops: sretest2001 has broken psu - https://phabricator.wikimedia.org/T423179 (10Jhancock.wm) 03NEW
[18:04:04] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.depool depool pc1018: Security updates
[18:04:04] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[18:04:12] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[18:04:12] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1018: Security updates
[18:05:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:05:26] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T419635)', diff saved to https://phabricator.wikimedia.org/P90562 and previous config saved to /var/cache/conftool/dbconfig/20260413-180525-fceratto.json
[18:05:26] <wikibugs>	 (03PS1) 10Bking: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860)
[18:05:29] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[18:05:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[18:05:43] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance
[18:05:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T419635)', diff saved to https://phabricator.wikimedia.org/P90563 and previous config saved to /var/cache/conftool/dbconfig/20260413-180551-fceratto.json
[18:05:54] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[18:05:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[18:06:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1269649 (owner: 10Dzahn)
[18:06:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:07:17] <wikibugs>	 (03PS2) 10Bking: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860)
[18:07:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:07:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[18:09:03] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T419635)', diff saved to https://phabricator.wikimedia.org/P90564 and previous config saved to /var/cache/conftool/dbconfig/20260413-180902-fceratto.json
[18:09:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:10:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:10:39] <wikibugs>	 (03PS3) 10Bking: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860)
[18:11:11] <wikibugs>	 10ops-codfw, 06DC-Ops: sretest2001 has broken psu - https://phabricator.wikimedia.org/T423179#11816305 (10Jhancock.wm)
[18:11:15] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[18:16:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:16:49] <wikibugs>	 (03CR) 10Zabe: "The point is that currently testcommons sees itself as a client of commons since we set nothing else. But we set x1 as a virtual domain. I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe)
[18:16:54] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[18:17:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:19:11] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P90565 and previous config saved to /var/cache/conftool/dbconfig/20260413-181911-fceratto.json
[18:19:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:20:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:22:23] <wikibugs>	 10ops-codfw, 10Data-Persistence-Misc, 06DC-Ops: db2201 broken DIMM - https://phabricator.wikimedia.org/T423184 (10Jhancock.wm) 03NEW
[18:22:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:26:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:27:22] <wikibugs>	 (03PS1) 10Zabe: Start reading from new file tables on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548)
[18:27:38] <wikibugs>	 (03CR) 10Zabe: [C:04-2] Start reading from new file tables on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[18:27:54] <zabe>	 jouncebot: nowandnext
[18:27:55] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 32 minute(s)
[18:27:55] <jouncebot>	 In 1 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T2000)
[18:28:30] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[18:29:07] <wikibugs>	 (03CR) 10Zabe: [C:03+2] NewFilesPager: Make sure filerevision is queried before file [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270068 (https://phabricator.wikimedia.org/T422946) (owner: 10Zabe)
[18:29:20] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P90566 and previous config saved to /var/cache/conftool/dbconfig/20260413-182919-fceratto.json
[18:30:25] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.pool pool pc1018: Security updates
[18:30:26] <logmsgbot>	 !log root@cumin1003 START - Cookbook sre.mysql.parsercache
[18:30:39] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[18:30:39] <logmsgbot>	 !log root@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1018: Security updates
[18:33:07] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280)
[18:36:55] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_ulsfo - 9.2.13 Upgrade ()
[18:37:02] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-text_ulsfo - 9.2.13 Upgrade ()
[18:38:54] <wikibugs>	 (03Merged) 10jenkins-bot: NewFilesPager: Make sure filerevision is queried before file [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270068 (https://phabricator.wikimedia.org/T422946) (owner: 10Zabe)
[18:39:28] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T419635)', diff saved to https://phabricator.wikimedia.org/P90568 and previous config saved to /var/cache/conftool/dbconfig/20260413-183927-fceratto.json
[18:39:36] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[18:39:45] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance
[18:39:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T419635)', diff saved to https://phabricator.wikimedia.org/P90569 and previous config saved to /var/cache/conftool/dbconfig/20260413-183953-fceratto.json
[18:40:21] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1270068|NewFilesPager: Make sure filerevision is queried before file (T422946)]]
[18:40:25] <stashbot>	 T422946: Expectation (readQueryTime <= 5) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T422946
[18:41:56] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1270068|NewFilesPager: Make sure filerevision is queried before file (T422946)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:43:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T419635)', diff saved to https://phabricator.wikimedia.org/P90570 and previous config saved to /var/cache/conftool/dbconfig/20260413-184305-fceratto.json
[18:44:16] <logmsgbot>	 !log zabe@deploy1003 Sync cancelled.
[18:45:07] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_drmrs - 9.2.13 Upgrade ()
[18:45:10] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_drmrs - 9.2.13 Upgrade ()
[18:46:46] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] nftables: cleanup tests [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway)
[18:47:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11816490 (10Jclark-ctr) @BTullis  the drive has arrived when can it be replaced?
[18:49:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11816505 (10Jclark-ctr) @klausman did you need us to order the drive?
[18:51:10] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: apply
[18:51:19] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: apply
[18:51:20] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply
[18:51:41] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[18:51:42] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[18:52:16] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[18:52:17] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: apply
[18:52:32] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply
[18:52:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply
[18:53:01] <rzl>	 is that a whiff of charlie I detect in the air?
[18:53:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P90571 and previous config saved to /var/cache/conftool/dbconfig/20260413-185314-fceratto.json
[18:53:24] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[18:53:25] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply
[18:53:44] <swfrench-wmf>	 rzl: ha, alas all shell loop
[18:54:04] <rzl>	 haha oh well
[18:54:44] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply
[18:55:05] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
[18:59:54] <wikibugs>	 (03PS1) 10Zabe: Revert "NewFilesPager: Make sure filerevision is queried before file" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270517
[19:00:07] <wikibugs>	 (03CR) 10Zabe: [V:03+2 C:03+2] Revert "NewFilesPager: Make sure filerevision is queried before file" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270517 (owner: 10Zabe)
[19:00:49] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1270517|Revert "NewFilesPager: Make sure filerevision is queried before file"]]
[19:02:26] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1270517|Revert "NewFilesPager: Make sure filerevision is queried before file"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:02:52] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[19:03:23] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P90572 and previous config saved to /var/cache/conftool/dbconfig/20260413-190322-fceratto.json
[19:04:51] <swfrench-wmf>	 FYI, I'll be applying some pending diffs from https://gerrit.wikimedia.org/r/1270496 to the production equivalents of the staging services updated above ^
[19:05:00] <wikibugs>	 (03PS1) 10Arlolra: Deploy PRV to 4 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270518 (https://phabricator.wikimedia.org/T423188)
[19:06:40] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270517|Revert "NewFilesPager: Make sure filerevision is queried before file"]] (duration: 05m 51s)
[19:07:27] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply
[19:07:52] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply
[19:08:23] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/device-analytics: apply
[19:08:46] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply
[19:09:17] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[19:09:38] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[19:10:10] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply
[19:10:28] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply
[19:10:59] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: apply
[19:11:20] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply
[19:11:42] <wikibugs>	 (03PS6) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497
[19:11:51] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
[19:12:08] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
[19:12:27] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[19:13:31] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T419635)', diff saved to https://phabricator.wikimedia.org/P90573 and previous config saved to /var/cache/conftool/dbconfig/20260413-191330-fceratto.json
[19:13:35] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[19:13:48] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance
[19:13:56] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90574 and previous config saved to /var/cache/conftool/dbconfig/20260413-191355-fceratto.json
[19:14:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[19:14:52] <swfrench-wmf>	 !log applied aqs cassandra host list changes from https://gerrit.wikimedia.org/r/1270496 - T423168
[19:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:56] <stashbot>	 T423168: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168
[19:17:08] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90575 and previous config saved to /var/cache/conftool/dbconfig/20260413-191707-fceratto.json
[19:19:34] <wikibugs>	 (03PS7) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497
[19:19:42] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[19:22:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[19:24:07] <wikibugs>	 06SRE: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816600 (10Scott_French)
[19:25:11] <cdanis>	 !log 💙cdanis@apt1002.wikimedia.org ~ 🕞🍵 sudo -i reprepro --component main --restrict cidergrinder update trixie-wikimedia
[19:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:18] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P90576 and previous config saved to /var/cache/conftool/dbconfig/20260413-192715-fceratto.json
[19:28:35] <wikibugs>	 (03PS8) 10Andrew Bogott: ceph/radosgw: enable static website creation [puppet] - 10https://gerrit.wikimedia.org/r/1270497
[19:33:05] <wikibugs>	 06SRE, 10Cassandra: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816623 (10Scott_French) p:05High→03Medium @Eevans - Could I ask you to pick up the documentation change for Cassandra host turn-up?  Basically, once the new host reaches t...
[19:34:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:35:02] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1001.eqiad.wmnet
[19:35:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:35:23] <wikibugs>	 06SRE, 10Cassandra: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816626 (10Scott_French)
[19:35:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:36:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:36:59] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_drmrs - 9.2.13 Upgrade ()
[19:37:26] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P90577 and previous config saved to /var/cache/conftool/dbconfig/20260413-193726-fceratto.json
[19:38:34] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270497 (owner: 10Andrew Bogott)
[19:39:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:39:24] <wikibugs>	 (03CR) 10JHathaway: P:base: Make nftables::set resources always defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266205 (owner: 10Majavah)
[19:39:49] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-text_drmrs - 9.2.13 Upgrade ()
[19:41:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:42:02] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet
[19:42:16] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1002.eqiad.wmnet
[19:45:34] <wikibugs>	 (03PS1) 10Andrew Bogott: ceph config: remove defaults for some optional args [puppet] - 10https://gerrit.wikimedia.org/r/1270542
[19:45:44] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270542 (owner: 10Andrew Bogott)
[19:45:49] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[3066,3068-3073].esams.wmnet} and A:cp - 9.2.13 Upgrade ()
[19:46:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph config: remove defaults for some optional args [puppet] - 10https://gerrit.wikimedia.org/r/1270542 (owner: 10Andrew Bogott)
[19:46:50] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[3075-3081].esams.wmnet} and A:cp - 9.2.13 Upgrade ()
[19:47:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T419635)', diff saved to https://phabricator.wikimedia.org/P90578 and previous config saved to /var/cache/conftool/dbconfig/20260413-194734-fceratto.json
[19:47:37] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[19:47:51] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance
[19:47:59] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2181 (T419635)', diff saved to https://phabricator.wikimedia.org/P90579 and previous config saved to /var/cache/conftool/dbconfig/20260413-194759-fceratto.json
[19:49:10] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet
[19:49:29] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1003.eqiad.wmnet
[19:50:25] <wikibugs>	 (03Abandoned) 10Andrew Bogott: ceph config: remove defaults for some optional args [puppet] - 10https://gerrit.wikimedia.org/r/1270542 (owner: 10Andrew Bogott)
[19:51:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T419635)', diff saved to https://phabricator.wikimedia.org/P90580 and previous config saved to /var/cache/conftool/dbconfig/20260413-195113-fceratto.json
[19:56:29] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet
[19:58:04] <wikibugs>	 (03PS1) 10Ottomata: mw-page-html-content-change-enrich-next - try sync mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270548 (https://phabricator.wikimedia.org/T421216)
[19:59:30] <wikibugs>	 (03PS2) 10Ottomata: mw-page-html-content-change-enrich-next - try sync mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270548 (https://phabricator.wikimedia.org/T421216)
[20:00:00] <wikibugs>	 (03CR) 10Ottomata: [V:03+2 C:03+2] mw-page-html-content-change-enrich-next - try sync mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270548 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:01:02] <logmsgbot>	 !log andrewtavis-wmde@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[20:01:04] <logmsgbot>	 !log andrewtavis-wmde@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[20:01:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P90581 and previous config saved to /var/cache/conftool/dbconfig/20260413-200122-fceratto.json
[20:02:25] <wikibugs>	 10ops-codfw, 10Data-Persistence-Misc, 06DC-Ops: move es2036 - https://phabricator.wikimedia.org/T423195 (10Jhancock.wm) 03NEW
[20:05:03] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[20:07:01] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[20:07:05] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[20:10:03] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[20:11:30] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P90582 and previous config saved to /var/cache/conftool/dbconfig/20260413-201130-fceratto.json
[20:21:37] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T419635)', diff saved to https://phabricator.wikimedia.org/P90583 and previous config saved to /var/cache/conftool/dbconfig/20260413-202137-fceratto.json
[20:21:41] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[20:21:53] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2195.codfw.wmnet with reason: Maintenance
[20:22:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2195 (T419635)', diff saved to https://phabricator.wikimedia.org/P90584 and previous config saved to /var/cache/conftool/dbconfig/20260413-202201-fceratto.json
[20:22:53] <wikibugs>	 06SRE, 10Cassandra: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816804 (10Eevans) >>! In T423168#11816624, @Scott_French wrote: > @Eevans - Could I ask you to pick up the documentation change for Cassandra host turn-up? >  > Basically, onc...
[20:23:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: add backup yubikey to myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1269649 (owner: 10Dzahn)
[20:25:07] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T419635)', diff saved to https://phabricator.wikimedia.org/P90585 and previous config saved to /var/cache/conftool/dbconfig/20260413-202506-fceratto.json
[20:25:41] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "with out-of-band verification" [puppet] - 10https://gerrit.wikimedia.org/r/1269649 (owner: 10Dzahn)
[20:27:29] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[20:28:03] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[3075-3081].esams.wmnet} and A:cp - 9.2.13 Upgrade ()
[20:31:22] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[3066,3068-3073].esams.wmnet} and A:cp - 9.2.13 Upgrade ()
[20:35:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P90586 and previous config saved to /var/cache/conftool/dbconfig/20260413-203514-fceratto.json
[20:38:06] <wikibugs>	 06SRE, 10Cassandra: aqs-http-gateway services at risk due to inaccessible cassandra hosts - https://phabricator.wikimedia.org/T423168#11816863 (10Scott_French) >>! In T423168#11816804, @Eevans wrote: > [...] > These almost always occur in batches (i.e. hardware refreshes, expansions, etc), usually on the order...
[20:40:04] <wikibugs>	 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T423199 (10JKelsoteel-WMF) 03NEW
[20:40:10] <wikibugs>	 (03PS4) 10Cwhite: opensearch: hack around upstream 2.x+ packages [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[20:41:48] <wikibugs>	 (03PS1) 10Kamila Součková: Revert "shellbox: Setup shellbox-icu72" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546)
[20:41:48] <wikibugs>	 (03PS3) 10Ryan Kemper: growthbook: Add automation API key placeholders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696)
[20:41:48] <wikibugs>	 (03PS1) 10Ryan Kemper: growthbook: Fix env var indent in job template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691)
[20:41:50] <wikibugs>	 (03PS1) 10Ryan Kemper: growthbook: Drop dead SSO_CONFIG placeholder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696)
[20:42:53] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1270511 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking)
[20:45:23] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P90587 and previous config saved to /var/cache/conftool/dbconfig/20260413-204523-fceratto.json
[20:46:13] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Revert "Enable $wgTempCategoryCollations for testwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270470 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková)
[20:47:04] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Revert "Temporarily add shellbox-icu to $wgShellboxUrls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270472 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková)
[20:48:30] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[20:50:40] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:53:32] <wikibugs>	 (03CR) 10Scott French: Revert "shellbox: Setup shellbox-icu72" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270557 (https://phabricator.wikimedia.org/T422546) (owner: 10Kamila Součková)
[20:53:42] <wikibugs>	 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T423199#11816927 (10BCornwall) 05Open→03In progress a:03BCornwall
[20:55:31] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T419635)', diff saved to https://phabricator.wikimedia.org/P90588 and previous config saved to /var/cache/conftool/dbconfig/20260413-205531-fceratto.json
[20:55:35] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[20:55:38] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance
[20:55:40] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:56:16] <wikibugs>	 (03PS1) 10BCornwall: wikimedia.org: Add TXT verification for Miro [dns] - 10https://gerrit.wikimedia.org/r/1270568 (https://phabricator.wikimedia.org/T423199)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T2100).
[21:06:47] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wikimedia.org: Add TXT verification for Miro [dns] - 10https://gerrit.wikimedia.org/r/1270568 (https://phabricator.wikimedia.org/T423199) (owner: 10BCornwall)
[21:08:54] <logmsgbot>	 !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1025.eqiad.wmnet with reason: Bootstrapping — T412830
[21:08:58] <stashbot>	 T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830
[21:13:10] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_eqsin - 9.2.13 Upgrade ()
[21:13:19] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_eqsin - 9.2.13 Upgrade ()
[21:13:27] <wikibugs>	 (03PS2) 10Bodhisattwa: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567
[21:13:41] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:14:49] <wikibugs>	 (03PS1) 10Mstyles: Route email confirmation funnel through Test Kitchen experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270571 (https://phabricator.wikimedia.org/T420007)
[21:15:40] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:15:59] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance
[21:16:07] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T410589)', diff saved to https://phabricator.wikimedia.org/P90589 and previous config saved to /var/cache/conftool/dbconfig/20260413-211606-ladsgroup.json
[21:16:12] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[21:17:22] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 4 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270518 (https://phabricator.wikimedia.org/T423188) (owner: 10Arlolra)
[21:20:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270571 (https://phabricator.wikimedia.org/T420007) (owner: 10Mstyles)
[21:20:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] floating_ip_updater: use project name (not id) for ptr records [puppet] - 10https://gerrit.wikimedia.org/r/1264738 (https://phabricator.wikimedia.org/T421739) (owner: 10Andrew Bogott)
[21:23:13] <sbassett>	 Hey all - I have a couple of sec patches I’d like to get out today.
[21:24:32] <wikibugs>	 (03CR) 10Jon Harald Søby: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa)
[21:33:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:34:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:36:03] <wikibugs>	 (03PS3) 10Bodhisattwa: Enable PageImages extenstions to NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567
[21:38:44] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[21:40:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:41:17] <sbassett>	 !log Deployed security patch for T418533
[21:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:11] <wikibugs>	 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro - https://phabricator.wikimedia.org/T423199#11817142 (10Dzahn)
[21:44:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:44:49] <logmsgbot>	 !log sbassett@deploy1003 Started scap sync-world: Deployed security fix for T422085
[21:47:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:50:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:51:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:52:13] <swfrench-wmf>	 FYI, I'm going to be applying some pending external-services network policy changes in the background
[21:52:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:52:57] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[21:53:58] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[21:54:18] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[21:55:02] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[21:55:14] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=1) Rolling upgrade of ATS on A:cp-text_eqsin - 9.2.13 Upgrade ()
[21:55:29] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[21:56:32] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[22:02:32] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] wikimedia.org: Add TXT verification for Miro [dns] - 10https://gerrit.wikimedia.org/r/1270568 (https://phabricator.wikimedia.org/T423199) (owner: 10BCornwall)
[22:02:48] <wikibugs>	 (03PS1) 10Dzahn: add fake keys for new zuul to connect to gerrit [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895)
[22:02:50] <logmsgbot>	 !log brett@dns1006 START - running authdns-update
[22:02:59] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[22:03:18] <wikibugs>	 (03PS2) 10Dzahn: add fake keys for new zuul to connect to gerrit [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895)
[22:03:30] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[22:04:03] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] "not-labs-not-private in labs/private" [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn)
[22:04:05] <swfrench-wmf>	 !log applied pending external-services network policy diffs for aqs1025 in wikikube clusters
[22:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:12] <logmsgbot>	 !log brett@dns1006 END - running authdns-update
[22:06:06] <wikibugs>	 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro - https://phabricator.wikimedia.org/T423199#11817200 (10BCornwall) Hi, @JKelsoteel-WMF ! This has been deployed - I'm going to go ahead and close this; Please do re-open if something...
[22:06:12] <wikibugs>	 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro - https://phabricator.wikimedia.org/T423199#11817203 (10BCornwall) 05In progress→03Resolved
[22:08:44] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_eqsin - 9.2.13 Upgrade ()
[22:08:59] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp[5023-5024].eqsin.wmnet} and A:cp - 9.2.13 Upgrade ()
[22:11:40] <wikibugs>	 (03CR) 10Bodhisattwa: "thanks for the correction, its now restored" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa)
[22:15:03] <logmsgbot>	 !log sbassett@deploy1003 Finished scap sync-world: Deployed security fix for T422085 (duration: 30m 14s)
[22:15:46] <sbassett>	 …and done
[22:23:07] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp[5023-5024].eqsin.wmnet} and A:cp - 9.2.13 Upgrade ()
[22:24:37] <wikibugs>	 (03PS1) 10Dzahn: zuul: make gerrit ssh key configurable in Hiera and add it [puppet] - 10https://gerrit.wikimedia.org/r/1270580 (https://phabricator.wikimedia.org/T422895)
[22:26:06] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-text_codfw - 9.2.13 Upgrade ()
[22:26:09] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-upload_codfw - 9.2.13 Upgrade ()
[22:27:51] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11817242 (10Ladsgroup) Databases now have a centralized depool and repool cookbook that encapsulates all the different ways you need to depool and repool db hosts (for different...
[22:29:50] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5023.*
[22:29:54] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5024.*
[22:35:19] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] rest gateway: handle percent-escaped pipes in query params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270514 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[22:36:14] <wikibugs>	 (03PS1) 10Brian Wolff: Record file usage from TemplateStyles pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270583 (https://phabricator.wikimedia.org/T413707)
[22:50:03] <wikibugs>	 (03PS1) 10Cwhite: logging: add ocsp secret [labs/private] - 10https://gerrit.wikimedia.org/r/1270586 (https://phabricator.wikimedia.org/T350516)
[22:51:01] <wikibugs>	 (03CR) 10Cwhite: [V:03+2 C:03+2] logging: add ocsp secret [labs/private] - 10https://gerrit.wikimedia.org/r/1270586 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[22:56:13] <wikibugs>	 (03PS1) 10Cwhite: Revert "logging: add dummy pki "secrets"" [labs/private] - 10https://gerrit.wikimedia.org/r/1270589
[22:56:51] <wikibugs>	 (03CR) 10Cwhite: [V:03+2 C:03+2] Revert "logging: add dummy pki "secrets"" [labs/private] - 10https://gerrit.wikimedia.org/r/1270589 (owner: 10Cwhite)
[23:00:05] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260413T2300)
[23:00:40] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1025-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:02:05] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-upload_codfw - 9.2.13 Upgrade ()
[23:03:06] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: change root_ocsp_key path to match labs-private [puppet] - 10https://gerrit.wikimedia.org/r/1270590 (https://phabricator.wikimedia.org/T350516)
[23:03:12] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] beta-logs: change root_ocsp_key path to match labs-private [puppet] - 10https://gerrit.wikimedia.org/r/1270590 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[23:05:28] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-text_codfw - 9.2.13 Upgrade ()
[23:05:40] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs1025-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:09:30] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: change private_cert_base to match labs-private [puppet] - 10https://gerrit.wikimedia.org/r/1270591 (https://phabricator.wikimedia.org/T350516)
[23:12:13] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] beta-logs: change private_cert_base to match labs-private [puppet] - 10https://gerrit.wikimedia.org/r/1270591 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[23:15:47] <wikibugs>	 (03PS1) 10Bvibber: Enable ReaderExperiments for itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173)
[23:17:37] <wikibugs>	 (03CR) 10Eric Gardner: [C:03+1] "LGTM – we can talk about backporting tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270592 (https://phabricator.wikimedia.org/T423173) (owner: 10Bvibber)
[23:20:05] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance backend on cp2057 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:21:05] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance backend on cp2057 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:25:58] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: add dummy pki "secrets" [puppet] - 10https://gerrit.wikimedia.org/r/1270593 (https://phabricator.wikimedia.org/T350516)
[23:26:54] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] beta-logs: add dummy pki "secrets" [puppet] - 10https://gerrit.wikimedia.org/r/1270593 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite)
[23:39:14] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270594
[23:39:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270594 (owner: 10TrainBranchBot)
[23:42:57] <wikibugs>	 (03PS1) 10Andrew Bogott: floating_ip_updater: use project name (not id) for ptr records [puppet] - 10https://gerrit.wikimedia.org/r/1270595 (https://phabricator.wikimedia.org/T421739)
[23:45:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] floating_ip_updater: use project name (not id) for ptr records [puppet] - 10https://gerrit.wikimedia.org/r/1270595 (https://phabricator.wikimedia.org/T421739) (owner: 10Andrew Bogott)
[23:49:06] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool pool db2208: Work done
[23:49:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1270594 (owner: 10TrainBranchBot)
[23:49:51] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: sync
[23:49:58] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: sync
[23:50:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:51:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:53:56] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: sync
[23:54:03] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: sync
[23:54:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:55:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:59:12] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270600