[00:20:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:23:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:26:46] FIRING: [5x] RedisReplicaDown: Redis replica down rdb2014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [01:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:09:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1296704 [01:09:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1296704 (owner: 10TrainBranchBot) [01:11:46] (03CR) 10RLazarus: [C:04-1] "Everything we're using in production images should come from either our own APT repo under our control, or from debian upstream -- we can'" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427989) (owner: 10Jforrester) [01:13:44] (03CR) 10RLazarus: [C:03+2] Drop the abstractwiki-rust-web images, no longer used [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296681 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [01:13:47] (03CR) 10RLazarus: [V:03+2 C:03+2] Drop the abstractwiki-rust-web images, no longer used [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296681 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [01:20:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:22:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1296704 (owner: 10TrainBranchBot) [01:23:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:27:24] (03PS2) 10Anzx: jawiki: lift IP caps for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296703 (https://phabricator.wikimedia.org/T427912) [01:27:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296703 (https://phabricator.wikimedia.org/T427912) (owner: 10Anzx) [01:37:44] (03CR) 10Jforrester: "Thank you!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296681 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [01:47:16] (03CR) 10Jforrester: "Hmm. This is "just" moving the build from CI run-time to baked-in, and this image isn't ever actually used in production (it's for buildin" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427989) (owner: 10Jforrester) [02:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:41] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:26:46] FIRING: [5x] RedisReplicaDown: Redis replica down rdb2014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [04:48:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:53:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:16:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:17:11] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1056: Upgrading es1056.eqiad.wmnet [05:17:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1056: Upgrading es1056.eqiad.wmnet [05:18:21] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS trixie [05:19:40] urbanecm: sorry I didn’t see your ping. I think that message occurred because wmf.5 wasn’t deployed [05:33:22] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1056.eqiad.wmnet with reason: host reimage [05:33:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11979903 (10ayounsi) [05:34:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11979906 (10ayounsi) [05:34:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11979908 (10ayounsi) [05:39:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1056.eqiad.wmnet with reason: host reimage [05:41:03] (03CR) 10Arnaudb: "good idea, I added @mmuhlenhoff@wikimedia.org as a reviewer" [puppet] - 10https://gerrit.wikimedia.org/r/1296495 (https://phabricator.wikimedia.org/T420184) (owner: 10Arnaudb) [05:55:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1056.eqiad.wmnet with OS trixie [05:55:56] (03PS1) 10Ayounsi: Loopback filter: allow internal traceroutes [homer/public] - 10https://gerrit.wikimedia.org/r/1296933 (https://phabricator.wikimedia.org/T348120) [05:58:41] marostegui@cumin1003 major-upgrade (PID 509014) is awaiting input [05:59:11] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [05:59:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1056: repool after upgrade [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T0600) [06:09:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:09:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2231: Upgrading db2231.codfw.wmnet [06:09:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2231: Upgrading db2231.codfw.wmnet [06:16:59] (03PS16) 10Daniel Kinzler: rest gateway: implement cost-based rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) [06:17:30] marostegui@cumin1003 major-upgrade (PID 516893) is awaiting input [06:19:47] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2231.codfw.wmnet with OS trixie [06:32:54] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations, 13Patch-For-Review: Change puppet-merge git origin to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org - https://phabricator.wikimedia.org/T420184#11979958 (10ABran-WMF) good idea @Dzahn I pinged @MoritzMuehlenhoff on... [06:36:01] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2231.codfw.wmnet with reason: host reimage [06:40:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2231.codfw.wmnet with reason: host reimage [06:44:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1056: repool after upgrade [06:45:35] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:45:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1049: Upgrading es1049.eqiad.wmnet [06:46:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es2056 to es2 codfw primary T427875', diff saved to https://phabricator.wikimedia.org/P93632 and previous config saved to /var/cache/conftool/dbconfig/20260603-064623-marostegui.json [06:46:28] T427875: Migrate es2 section to Debian Trixie - https://phabricator.wikimedia.org/T427875 [06:46:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1049: Upgrading es1049.eqiad.wmnet [06:50:35] marostegui@cumin1003 major-upgrade (PID 520450) is awaiting input [06:52:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1049.eqiad.wmnet with OS trixie [06:56:25] (03PS1) 10Anzx: conductwiki: add sitename and logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296713 (https://phabricator.wikimedia.org/T426984) [06:56:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296713 (https://phabricator.wikimedia.org/T426984) (owner: 10Anzx) [06:57:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2231.codfw.wmnet with OS trixie [07:00:05] Amir1, urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T0700). [07:00:05] Msz2001, anzx, and matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:07] o/ [07:00:29] o/ [07:03:07] o/ [07:03:14] I'm ready to deploy [07:04:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296516 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [07:04:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296517 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [07:04:43] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [07:07:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2231: repool after maintenance [07:07:32] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1049.eqiad.wmnet with reason: host reimage [07:08:46] (03Merged) 10jenkins-bot: Add a reply-to to Direct Reporting emails [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296516 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [07:08:50] (03Merged) 10jenkins-bot: Add a reply-to to Direct Reporting emails [extensions/ReportIncident] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296517 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [07:09:02] (03CR) 10Elukey: "I think there are two tasks at the moment:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [07:10:25] (03CR) 10Elukey: [C:03+1] tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295932 (owner: 10Jgiannelos) [07:11:11] (03CR) 10Elukey: [C:03+2] docker_registry: remove duplicates from registry-homepage-builder.py [puppet] - 10https://gerrit.wikimedia.org/r/1295371 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [07:11:55] Spiderpig aborted deployment with message: "Aborting: git is not clean: /srv/patches". I'll check what's up there [07:13:46] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11980004 (10mahmoud.abdelsattar.wmde) Thanks a lot @Dzahn! All the best! [07:14:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1049.eqiad.wmnet with reason: host reimage [07:16:32] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1296516|Add a reply-to to Direct Reporting emails (T427788 T427791 T427829)]], [[gerrit:1296517|Add a reply-to to Direct Reporting emails (T427788 T427791 T427829)]] [07:16:40] T427788: Add user email address as reply-to for direct reporting emails - https://phabricator.wikimedia.org/T427788 [07:16:40] T427791: Show configured destination email address in direct reporting flow - https://phabricator.wikimedia.org/T427791 [07:16:41] T427829: Update direct reporting copy to "community responders" - https://phabricator.wikimedia.org/T427829 [07:16:41] (03PS1) 10Komla Sapaty: profile::toolforge::bastion: add SSH login activity export timer [puppet] - 10https://gerrit.wikimedia.org/r/1296944 [07:18:43] (03CR) 10CI reject: [V:04-1] profile::toolforge::bastion: add SSH login activity export timer [puppet] - 10https://gerrit.wikimedia.org/r/1296944 (owner: 10Komla Sapaty) [07:24:50] (03PS1) 10MVernon: swift: remove 2 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1296946 (https://phabricator.wikimedia.org/T421719) [07:26:03] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations: Create new S3 backends for the Docker Registry service - https://phabricator.wikimedia.org/T427175#11980050 (10elukey) [07:26:45] (03CR) 10Marostegui: [C:03+1] swift: remove 2 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1296946 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [07:28:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:30:03] (03CR) 10MVernon: [C:03+2] swift: remove 2 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1296946 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [07:30:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:31:06] (03CR) 10Brouberol: [C:03+1] kafka event platform logs - Strip the stray $!msg field [puppet] - 10https://gerrit.wikimedia.org/r/1296607 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [07:32:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1049.eqiad.wmnet with OS trixie [07:35:15] !log mszwarc@deploy1003 mszwarc, stran: Backport for [[gerrit:1296516|Add a reply-to to Direct Reporting emails (T427788 T427791 T427829)]], [[gerrit:1296517|Add a reply-to to Direct Reporting emails (T427788 T427791 T427829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:35:21] T427788: Add user email address as reply-to for direct reporting emails - https://phabricator.wikimedia.org/T427788 [07:35:21] T427791: Show configured destination email address in direct reporting flow - https://phabricator.wikimedia.org/T427791 [07:35:22] T427829: Update direct reporting copy to "community responders" - https://phabricator.wikimedia.org/T427829 [07:35:30] (03CR) 10Arnaudb: trafficserver: add a map for gitlab as a backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [07:35:36] marostegui@cumin1003 major-upgrade (PID 520450) is awaiting input [07:35:59] !log mszwarc@deploy1003 mszwarc, stran: Continuing with deployment [07:36:38] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296580 (https://phabricator.wikimedia.org/T427917) (owner: 10Anzx) [07:36:48] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296703 (https://phabricator.wikimedia.org/T427912) (owner: 10Anzx) [07:37:04] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296713 (https://phabricator.wikimedia.org/T426984) (owner: 10Anzx) [07:37:22] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [07:37:35] (03Merged) 10jenkins-bot: Add kha to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296580 (https://phabricator.wikimedia.org/T427917) (owner: 10Anzx) [07:37:36] (03Abandoned) 10Komla Sapaty: profile::toolforge::bastion: add SSH login activity export timer [puppet] - 10https://gerrit.wikimedia.org/r/1296944 (owner: 10Komla Sapaty) [07:37:43] (03Merged) 10jenkins-bot: jawiki: lift IP caps for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296703 (https://phabricator.wikimedia.org/T427912) (owner: 10Anzx) [07:37:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1049: repool after upgrade [07:38:11] (03Merged) 10jenkins-bot: conductwiki: add sitename and logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296713 (https://phabricator.wikimedia.org/T426984) (owner: 10Anzx) [07:39:27] 06SRE, 06Infrastructure-Foundations: Build spicerack for Trixie - https://phabricator.wikimedia.org/T428024 (10elukey) 03NEW [07:40:36] (03PS1) 10Daniel Kinzler: rest gateway: harden Lua against rate-limit bypass [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296951 [07:41:50] (03PS2) 10Daniel Kinzler: rest gateway: harden Lua against rate-limit bypass [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296951 [07:42:31] (03PS1) 10Marostegui: wmnet: Update es2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1296952 (https://phabricator.wikimedia.org/T427875) [07:42:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1056 to es2 eqiad primary T427875', diff saved to https://phabricator.wikimedia.org/P93637 and previous config saved to /var/cache/conftool/dbconfig/20260603-074250-marostegui.json [07:42:54] T427875: Migrate es2 section to Debian Trixie - https://phabricator.wikimedia.org/T427875 [07:43:02] matthiasmullie: Can your patches be deployed together with anzx's or do you prefer to do them separately? [07:43:14] either WFM [07:43:26] (03CR) 10Marostegui: [C:03+2] wmnet: Update es2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1296952 (https://phabricator.wikimedia.org/T427875) (owner: 10Marostegui) [07:43:30] !log marostegui@dns1004 START - running authdns-update [07:43:41] So I can deploy them together, to spend less time overall [07:43:52] My deployments will finish soon [07:44:03] (03CR) 10Bartosz Wójtowicz: linked-artifacts: update for production deploy (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296683 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [07:44:34] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295968 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [07:44:48] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296627 (https://phabricator.wikimedia.org/T427821) (owner: 10Matthias Mullie) [07:44:56] !log marostegui@dns1004 END - running authdns-update [07:44:58] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296632 (https://phabricator.wikimedia.org/T407793) (owner: 10Matthias Mullie) [07:45:22] (03CR) 10Mszwarc: "Let's do this separately as it changes i18n" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296632 (https://phabricator.wikimedia.org/T407793) (owner: 10Matthias Mullie) [07:46:17] (03Merged) 10jenkins-bot: MultimediaViewer: enable image carousel as a beta feature on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295968 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [07:46:27] (03PS3) 10Komla Sapaty: profile::toolforge::bastion: add SSH login activity export timer [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (https://phabricator.wikimedia.org/T423549) [07:46:33] (03CR) 10Kosta Harlan: Revert^2 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296635 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [07:46:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296635 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [07:47:01] I’ll deploy my config patch at the end, could whoever goes last please ping me? [07:47:27] (03Merged) 10jenkins-bot: Add missing lazy img to carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296627 (https://phabricator.wikimedia.org/T427821) (owner: 10Matthias Mullie) [07:47:31] (03CR) 10Komla Sapaty: "I have removed the PII." [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (https://phabricator.wikimedia.org/T423549) (owner: 10Komla Sapaty) [07:47:48] kostajh: I can ping you, but there's one patch with i18n changes, so be prepared to wait a while [07:48:17] Msz2001: ok [07:48:19] (03PS1) 10Bartosz Wójtowicz: ml-services: Separate REST and gRPC deployments for outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296953 (https://phabricator.wikimedia.org/T418493) [07:48:46] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296516|Add a reply-to to Direct Reporting emails (T427788 T427791 T427829)]], [[gerrit:1296517|Add a reply-to to Direct Reporting emails (T427788 T427791 T427829)]] (duration: 32m 13s) [07:48:52] T427788: Add user email address as reply-to for direct reporting emails - https://phabricator.wikimedia.org/T427788 [07:48:53] T427791: Show configured destination email address in direct reporting flow - https://phabricator.wikimedia.org/T427791 [07:48:53] T427829: Update direct reporting copy to "community responders" - https://phabricator.wikimedia.org/T427829 [07:49:29] (03CR) 10Brouberol: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [07:50:06] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1296580|Add kha to wmgExtraLanguageNames (T427917)]], [[gerrit:1296703|jawiki: lift IP caps for workshop (T427912)]], [[gerrit:1296713|conductwiki: add sitename and logo (T426984 T427541)]], [[gerrit:1296627|Add missing lazy img to carousel (T427821)]], [[gerrit:1295968|MultimediaViewer: enable image carousel as a beta feature on Wikipedias (T426799)]] [07:50:12] I've started deploying all the remaining changes that don't update interface messages [07:50:21] T427917: Add monolingual language code kha (khasi language) - https://phabricator.wikimedia.org/T427917 [07:50:22] T427912: Lift IP cap on 4 days in June and July 2026 for Editation for jawiki - https://phabricator.wikimedia.org/T427912 [07:50:22] T426984: Create Conductwiki wiki - https://phabricator.wikimedia.org/T426984 [07:50:23] T427541: Set proper sitename for electcomwiki - https://phabricator.wikimedia.org/T427541 [07:50:23] T427821: [Image Browsing] Carousel: Missing image with legacy parser - https://phabricator.wikimedia.org/T427821 [07:50:23] T426799: [Image Browsing] Launch image carousel as beta feature - https://phabricator.wikimedia.org/T426799 [07:50:35] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:50:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2054: Upgrading es2054.codfw.wmnet [07:51:12] dancy: Are we going to have train in the UTC morning or afternoon? I assume afternoon? [07:51:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2054: Upgrading es2054.codfw.wmnet [07:52:12] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2054.codfw.wmnet with OS trixie [07:52:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2231: repool after maintenance [07:54:14] !log mszwarc@deploy1003 anzx, mlitn, mfossati, mszwarc: Backport for [[gerrit:1296580|Add kha to wmgExtraLanguageNames (T427917)]], [[gerrit:1296703|jawiki: lift IP caps for workshop (T427912)]], [[gerrit:1296713|conductwiki: add sitename and logo (T426984 T427541)]], [[gerrit:1296627|Add missing lazy img to carousel (T427821)]], [[gerrit:1295968|MultimediaViewer: enable image carousel as a beta feature on Wikipedias (T42 [07:54:14] 6799)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:54:19] Msz2001: looking [07:55:21] matthiasmullie: You can test yours as well (I see you weren't pinged by scap) [07:55:32] checking [07:56:03] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11980249 (10PantheraLeo1359531) Hi! I think LZW compressing would be very useful. Amir Sarabadani reached out to me about the issue. I would be gl... [07:56:19] jnuche: I'm assuming the train will roll in the UTC afternoon, so we can continue with backporting? [07:56:23] Msz2001: mine looks good, ok to sync [07:56:59] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Reimaging upstream server [07:57:02] LGTM too [07:57:08] !log mszwarc@deploy1003 anzx, mlitn, mfossati, mszwarc: Continuing with deployment [07:57:24] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on clouddb[1022-1023].eqiad.wmnet with reason: Reimaging upstream server [07:59:29] Having heard no answers from train conductors, I'll assume that the train window in a minute won't happen and that we can finish the remaining deployments [08:00:03] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:00:04] (which is consistent with yesterday's train which also happened in the afternoon UTC) [08:00:05] dancy and jnuche: That opportune time for a MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T0800). [08:00:42] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296632 (https://phabricator.wikimedia.org/T407793) (owner: 10Matthias Mullie) [08:01:14] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1157: Repooling [08:01:17] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1157: Repooling [08:01:25] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:01:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:01:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2215: Upgrading db2215.codfw.wmnet [08:01:45] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1211: Upgrading db1211.eqiad.wmnet [08:01:55] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:01:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2215: Upgrading db2215.codfw.wmnet [08:02:35] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1211: Upgrading db1211.eqiad.wmnet [08:03:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2215.codfw.wmnet with OS trixie [08:03:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:03:40] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1211.eqiad.wmnet with OS trixie [08:03:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T426633)', diff saved to https://phabricator.wikimedia.org/P93643 and previous config saved to /var/cache/conftool/dbconfig/20260603-080346-fceratto.json [08:03:53] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296580|Add kha to wmgExtraLanguageNames (T427917)]], [[gerrit:1296703|jawiki: lift IP caps for workshop (T427912)]], [[gerrit:1296713|conductwiki: add sitename and logo (T426984 T427541)]], [[gerrit:1296627|Add missing lazy img to carousel (T427821)]], [[gerrit:1295968|MultimediaViewer: enable image carousel as a beta feature on Wikipedias (T426799)] [08:03:53] ] (duration: 13m 47s) [08:03:58] Msz2001: thanks for deploying [08:04:03] T427917: Add monolingual language code kha (khasi language) - https://phabricator.wikimedia.org/T427917 [08:04:04] T427912: Lift IP cap on 4 days in June and July 2026 for Editation for jawiki - https://phabricator.wikimedia.org/T427912 [08:04:04] T426984: Create Conductwiki wiki - https://phabricator.wikimedia.org/T426984 [08:04:04] T427541: Set proper sitename for electcomwiki - https://phabricator.wikimedia.org/T427541 [08:04:05] T427821: [Image Browsing] Carousel: Missing image with legacy parser - https://phabricator.wikimedia.org/T427821 [08:04:05] T426799: [Image Browsing] Launch image carousel as beta feature - https://phabricator.wikimedia.org/T426799 [08:04:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296632 (https://phabricator.wikimedia.org/T407793) (owner: 10Matthias Mullie) [08:04:48] I have started the next backport, in the meantime I'll purge the updated logos [08:05:01] (03Merged) 10jenkins-bot: Image Browsing: add accessible labels to carousel elements [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296632 (https://phabricator.wikimedia.org/T407793) (owner: 10Matthias Mullie) [08:05:09] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924#11980282 (10jijiki) [08:05:32] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1296632|Image Browsing: add accessible labels to carousel elements (T407793)]] [08:05:36] T407793: Image Browsing: Ensure carousel meets accessibility standards - https://phabricator.wikimedia.org/T407793 [08:05:53] Thanks, Msz2001 - can skip testing for that last one [08:06:03] okay [08:08:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2054.codfw.wmnet with reason: host reimage [08:11:32] (03PS1) 10Dpogorzelski: ml-serve: tweak llm resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297053 [08:14:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2054.codfw.wmnet with reason: host reimage [08:15:54] (03CR) 10Gmodena: [C:03+1] "Terrific job @trueg@wikimedia.org!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [08:16:42] (03CR) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [08:16:50] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:17:12] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:17:30] (03PS16) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [08:17:31] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [08:17:35] (03PS4) 10Slyngshede: P:cache:haproxy add image generator information [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) [08:17:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T426633)', diff saved to https://phabricator.wikimedia.org/P93645 and previous config saved to /var/cache/conftool/dbconfig/20260603-081756-fceratto.json [08:18:12] (03CR) 10Slyngshede: P:cache:haproxy add image generator information (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [08:18:12] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:18:45] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-serve: tweak llm resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297053 (owner: 10Dpogorzelski) [08:18:50] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1211.eqiad.wmnet with reason: host reimage [08:22:45] !log mszwarc@deploy1003 mlitn, mszwarc: Backport for [[gerrit:1296632|Image Browsing: add accessible labels to carousel elements (T407793)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:22:49] T407793: Image Browsing: Ensure carousel meets accessibility standards - https://phabricator.wikimedia.org/T407793 [08:22:55] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2215.codfw.wmnet with reason: host reimage [08:23:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1049: repool after upgrade [08:24:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1211.eqiad.wmnet with reason: host reimage [08:25:14] !log mszwarc@deploy1003 mlitn, mszwarc: Continuing with deployment [08:26:31] (03PS1) 10Jakob: Search: Disable redundant search limit validation [extensions/Wikibase] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297058 (https://phabricator.wikimedia.org/T427935) [08:26:46] FIRING: [5x] RedisReplicaDown: Redis replica down rdb2014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [08:27:28] (03CR) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [08:28:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P93647 and previous config saved to /var/cache/conftool/dbconfig/20260603-082804-fceratto.json [08:28:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2215.codfw.wmnet with reason: host reimage [08:29:32] !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [08:30:22] !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:30:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:31:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2054.codfw.wmnet with OS trixie [08:31:26] !log jiji@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [08:31:55] !log jiji@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:33:15] !log jiji@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:33:42] !log jiji@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:33:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:14] !log jiji@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:34:22] marostegui@cumin1003 major-upgrade (PID 531258) is awaiting input [08:34:51] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: tweak llm resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297053 (owner: 10Dpogorzelski) [08:34:53] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [08:35:03] !log jiji@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:35:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2054.codfw.wmnet: After reimage [08:35:50] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool es2054.codfw.wmnet: After reimage [08:36:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2054: repool after upgrade [08:37:43] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296632|Image Browsing: add accessible labels to carousel elements (T407793)]] (duration: 32m 11s) [08:37:46] T407793: Image Browsing: Ensure carousel meets accessibility standards - https://phabricator.wikimedia.org/T407793 [08:37:47] kostajh: You can deploy now [08:38:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P93649 and previous config saved to /var/cache/conftool/dbconfig/20260603-083811-fceratto.json [08:38:21] Msz2001: thanks [08:38:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296635 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [08:39:00] Msz2001: thanks! [08:39:05] yw [08:40:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Wikibase] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297058 (https://phabricator.wikimedia.org/T427935) (owner: 10Jakob) [08:40:51] (03Merged) 10jenkins-bot: Revert^2 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296635 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [08:41:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1211.eqiad.wmnet with OS trixie [08:41:22] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1296635|Revert^2 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] [08:41:26] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [08:43:27] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1211: Migration of db1211.eqiad.wmnet completed [08:44:18] (03CR) 10CWilliams: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [08:45:19] !log jiji@cumin1003 START - Cookbook sre.discovery.service-route check docker-registry: maintenance [08:45:19] !log jiji@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check docker-registry: maintenance [08:45:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2215.codfw.wmnet with OS trixie [08:46:37] (03CR) 10FNegri: [C:03+1] P:toolforge::redis_sentinel: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295908 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [08:47:23] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1296635|Revert^2 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:47:27] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [08:48:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T426633)', diff saved to https://phabricator.wikimedia.org/P93651 and previous config saved to /var/cache/conftool/dbconfig/20260603-084819-fceratto.json [08:48:34] jouncebot: now [08:48:34] For the next 1 hour(s) and 11 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T0800) [08:48:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:48:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T426633)', diff saved to https://phabricator.wikimedia.org/P93652 and previous config saved to /var/cache/conftool/dbconfig/20260603-084846-fceratto.json [08:50:06] (03CR) 10AikoChou: [C:03+1] ml-services: Separate REST and gRPC deployments for outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296953 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [08:50:44] !log kharlan@deploy1003 kharlan: Rolling back deployment [08:50:46] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Separate REST and gRPC deployments for outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296953 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [08:51:12] (03PS1) 10Kosta Harlan: Revert^3 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297064 [08:51:34] (03PS2) 10Kosta Harlan: Revert^3 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297064 (https://phabricator.wikimedia.org/T403829) [08:51:51] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb[1022-1023].eqiad.wmnet [08:51:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb[1022-1023].eqiad.wmnet [08:52:10] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet [08:52:12] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet [08:52:58] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2215: Migration of db2215.codfw.wmnet completed [08:52:59] (03Merged) 10jenkins-bot: ml-services: Separate REST and gRPC deployments for outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296953 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [08:53:03] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:53:05] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296635|Revert^2 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] (duration: 11m 43s) [08:53:09] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [08:53:14] I will swap the redis servers of rateliits and rest-gateway [08:53:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297064 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [08:54:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:37] (03Merged) 10jenkins-bot: Revert^3 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297064 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [08:54:48] (03CR) 10Effie Mouzeli: [C:03+2] ratelimit: replace rdb2009 with rdb2013 #1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294274 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [08:55:04] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1297064|Revert^3 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] [08:55:20] (03PS1) 10Kosta Harlan: Revert^4 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297065 (https://phabricator.wikimedia.org/T403829) [08:58:03] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:58:38] (03PS1) 10Effie Mouzeli: ratelimit: fix typo in redis shards [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297066 [08:59:02] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1297064|Revert^3 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:59:06] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [08:59:38] (03PS1) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [08:59:41] !log kharlan@deploy1003 kharlan: Continuing with deployment [08:59:57] (03PS1) 10Ayounsi: network/data.yaml: add eqiad/codfw per rack public vlans [puppet] - 10https://gerrit.wikimedia.org/r/1297068 (https://phabricator.wikimedia.org/T422043) [09:00:04] (03PS2) 10Kosta Harlan: Revert^4 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297065 (https://phabricator.wikimedia.org/T403829) [09:00:15] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924#11980411 (10jijiki) [09:00:15] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new eqiad/codfw public vlans - ayounsi@cumin1003" [09:00:17] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "new eqiad/codfw public vlans - ayounsi@cumin1003" [09:00:35] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new eqiad/codfw public vlans - ayounsi@cumin1003 - T422043" [09:00:39] T422043: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043 [09:00:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T426633)', diff saved to https://phabricator.wikimedia.org/P93656 and previous config saved to /var/cache/conftool/dbconfig/20260603-090056-fceratto.json [09:01:19] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new eqiad/codfw public vlans - ayounsi@cumin1003 - T422043" [09:01:26] (03CR) 10Effie Mouzeli: [C:03+2] ratelimit: fix typo in redis shards [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297066 (owner: 10Effie Mouzeli) [09:01:28] (03PS1) 10Kosta Harlan: hCaptcha: Collect risk score for blocked account creations [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297069 (https://phabricator.wikimedia.org/T427784) [09:01:59] (03PS2) 10Ayounsi: network/data.yaml: add eqiad/codfw per rack public vlans [puppet] - 10https://gerrit.wikimedia.org/r/1297068 (https://phabricator.wikimedia.org/T422043) [09:02:08] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297068 (https://phabricator.wikimedia.org/T422043) (owner: 10Ayounsi) [09:03:49] (03Merged) 10jenkins-bot: ratelimit: fix typo in redis shards [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297066 (owner: 10Effie Mouzeli) [09:04:19] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:05:37] (03Abandoned) 10Gkyziridis: wgRestSandboxSpecs: Add LiftWing API OpenAPI specs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294988 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [09:05:46] (03CR) 10Effie Mouzeli: [C:03+2] radioscope: replace rdb2009 with rdb2013 #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294275 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [09:05:54] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [09:05:57] (03CR) 10CI reject: [V:04-1] radioscope: replace rdb2009 with rdb2013 #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294275 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [09:05:59] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297064|Revert^3 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] (duration: 10m 54s) [09:06:02] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [09:06:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297065 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [09:06:20] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [09:06:24] (03PS3) 10Effie Mouzeli: radioscope: replace rdb2009 with rdb2013 #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294275 (https://phabricator.wikimedia.org/T418924) [09:07:24] (03Merged) 10jenkins-bot: Revert^4 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297065 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [09:07:49] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1297065|Revert^4 "hCaptcha: Load self-hosted secure-api.js on group0 wikis"]] [09:09:14] 06SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 10Event-Platform: Flink Page View: Create K8s resources - https://phabricator.wikimedia.org/T426425#11980454 (10JMonton-WMF) [09:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:09:46] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1297065|Revert^4 "hCaptcha: Load self-hosted secure-api.js on group0 wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:10:40] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:11:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P93659 and previous config saved to /var/cache/conftool/dbconfig/20260603-091104-fceratto.json [09:14:56] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297065|Revert^4 "hCaptcha: Load self-hosted secure-api.js on group0 wikis"]] (duration: 07m 06s) [09:16:49] 06SRE, 06Traffic: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#11980480 (10SLyngshede-WMF) [09:17:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297069 (https://phabricator.wikimedia.org/T427784) (owner: 10Kosta Harlan) [09:17:55] (03CR) 10Effie Mouzeli: radioscope: replace rdb2009 with rdb2013 #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294275 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [09:18:57] (03CR) 10Cathal Mooney: [C:03+1] "I am probably not the best Python person to review this but I've stepped through it and the logic seems good to me and code is well struct" [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [09:19:28] 10SRE-tools, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11980483 (10elukey) I think we could study something that logs to a task if needed, as optional feature to toggle via parameter. We could d... [09:19:33] (03Merged) 10jenkins-bot: hCaptcha: Collect risk score for blocked account creations [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297069 (https://phabricator.wikimedia.org/T427784) (owner: 10Kosta Harlan) [09:19:58] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1297069|hCaptcha: Collect risk score for blocked account creations (T427784)]] [09:20:00] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:20:02] T427784: hCaptcha risk scores for blocked account creations - https://phabricator.wikimedia.org/T427784 [09:20:10] (03Merged) 10jenkins-bot: radioscope: replace rdb2009 with rdb2013 #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294275 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [09:21:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P93661 and previous config saved to /var/cache/conftool/dbconfig/20260603-092111-fceratto.json [09:21:26] !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/redioscope: apply [09:21:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2054: repool after upgrade [09:21:38] !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/redioscope: apply [09:21:51] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1297069|hCaptcha: Collect risk score for blocked account creations (T427784)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:22:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:22:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1053: Upgrading es1053.eqiad.wmnet [09:23:14] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:23:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1053: Upgrading es1053.eqiad.wmnet [09:24:22] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add public1-b3-codfw gateway IPs - ayounsi@cumin1003" [09:24:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add public1-b3-codfw gateway IPs - ayounsi@cumin1003" [09:24:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:25:00] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1053.eqiad.wmnet with OS trixie [09:25:47] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [09:27:25] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297069|hCaptcha: Collect risk score for blocked account creations (T427784)]] (duration: 07m 26s) [09:27:28] T427784: hCaptcha risk scores for blocked account creations - https://phabricator.wikimedia.org/T427784 [09:28:42] (03PS1) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297078 (https://phabricator.wikimedia.org/T425624) [09:28:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1211: Migration of db1211.eqiad.wmnet completed [09:28:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:30:14] 06SRE, 06Infrastructure-Foundations: Build spicerack for Trixie - https://phabricator.wikimedia.org/T428024#11980542 (10elukey) On Trixie: * python3-kafka goes to 2.0.2-9 * python3-kubernetes goes to 30.1.0-2 * python3-mysql goes to 1.4.6-2+b5 * python3-pynetbox goes to 7.4.1-1 * python3-etcd goes to 0.4.5-6... [09:30:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:31:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T426633)', diff saved to https://phabricator.wikimedia.org/P93666 and previous config saved to /var/cache/conftool/dbconfig/20260603-093119-fceratto.json [09:31:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:31:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T426633)', diff saved to https://phabricator.wikimedia.org/P93667 and previous config saved to /var/cache/conftool/dbconfig/20260603-093146-fceratto.json [09:32:09] (03CR) 10JavierMonton: [C:03+2] stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297078 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [09:32:29] (03PS1) 10MVernon: swift: migrate ms-be106[6-7] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1297080 (https://phabricator.wikimedia.org/T421719) [09:34:04] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [09:35:01] (03Merged) 10jenkins-bot: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297078 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [09:35:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:36:57] (03CR) 10Jcrespo: [C:03+1] swift: migrate ms-be106[6-7] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1297080 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [09:37:27] (03PS1) 10Jcrespo: backup: Add job ids for read-only backups [puppet] - 10https://gerrit.wikimedia.org/r/1297081 (https://phabricator.wikimedia.org/T424661) [09:37:53] (03CR) 10MVernon: [C:03+2] swift: migrate ms-be106[6-7] to new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1297080 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [09:38:14] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2013.codfw.wmnet [09:38:26] (03PS6) 10Cathal Mooney: netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) [09:38:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2215: Migration of db2215.codfw.wmnet completed [09:38:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:38:50] (03CR) 10Cathal Mooney: netops: set CR packet drop alert to paging and up timer on saturation (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [09:40:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T426633)', diff saved to https://phabricator.wikimedia.org/P93669 and previous config saved to /var/cache/conftool/dbconfig/20260603-094014-fceratto.json [09:41:27] 10SRE-tools, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11980580 (10Volans) Would adding the duration to the existing log messages in [1], [2] and [3] be enough? [1] https://gerrit.wikimedia.org... [09:41:36] (03PS1) 10Brouberol: dse-k8s-wdqs-test1001: set the node-role.kuberneteos.io node label [puppet] - 10https://gerrit.wikimedia.org/r/1297083 (https://phabricator.wikimedia.org/T425653) [09:41:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1053.eqiad.wmnet with reason: host reimage [09:41:53] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on es1053.eqiad.wmnet with reason: host reimage [09:42:17] (03PS2) 10Brouberol: dse-k8s-wdqs-test1001: set the node-role.kuberneteos.io node label [puppet] - 10https://gerrit.wikimedia.org/r/1297083 (https://phabricator.wikimedia.org/T425653) [09:42:52] (03PS3) 10Brouberol: dse-k8s-wdqs-test1001: set the node-role.kuberneteos.io node label [puppet] - 10https://gerrit.wikimedia.org/r/1297083 (https://phabricator.wikimedia.org/T425653) [09:43:29] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8633/co" [puppet] - 10https://gerrit.wikimedia.org/r/1297083 (https://phabricator.wikimedia.org/T425653) (owner: 10Brouberol) [09:43:44] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2013.codfw.wmnet [09:44:00] (03CR) 10Btullis: [C:03+1] "Great, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1297083 (https://phabricator.wikimedia.org/T425653) (owner: 10Brouberol) [09:44:11] (03CR) 10Brouberol: [C:03+2] dse-k8s-wdqs-test1001: set the node-role.kuberneteos.io node label [puppet] - 10https://gerrit.wikimedia.org/r/1297083 (https://phabricator.wikimedia.org/T425653) (owner: 10Brouberol) [09:44:16] (03CR) 10Btullis: [C:03+2] kafka event platform logs - Strip the stray $!msg field [puppet] - 10https://gerrit.wikimedia.org/r/1296607 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [09:46:35] (03PS4) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) [09:46:50] (03CR) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [09:47:44] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:48:07] (03CR) 10Cathal Mooney: [C:03+1] network/data.yaml: add eqiad/codfw per rack public vlans [puppet] - 10https://gerrit.wikimedia.org/r/1297068 (https://phabricator.wikimedia.org/T422043) (owner: 10Ayounsi) [09:48:35] (03CR) 10Ayounsi: [C:03+1] netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [09:48:50] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1053.eqiad.wmnet with OS trixie [09:49:24] (03PS3) 10Ayounsi: network/data.yaml: add eqiad/codfw per rack public vlans [puppet] - 10https://gerrit.wikimedia.org/r/1297068 (https://phabricator.wikimedia.org/T422043) [09:49:35] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:49:41] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:50:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P93670 and previous config saved to /var/cache/conftool/dbconfig/20260603-095022-fceratto.json [09:51:05] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [09:51:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:51:11] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [09:51:24] 10SRE-tools, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11980630 (10CWilliams-WMF) @elukey yes, I did have an idea... but @Volans suggesting that making it part of the log messages from the calls... [09:52:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:52:06] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [09:52:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:52:34] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [09:52:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1053: repool after upgrade [09:53:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:53:08] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:53:21] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:53:35] (03CR) 10Cathal Mooney: [C:03+1] network/data.yaml: add eqiad/codfw per rack public vlans [puppet] - 10https://gerrit.wikimedia.org/r/1297068 (https://phabricator.wikimedia.org/T422043) (owner: 10Ayounsi) [09:55:11] 10SRE-Access-Requests: Rotating production SSH-Key for @Michael to a Yubikey-based one - https://phabricator.wikimedia.org/T428037 (10Michael) 03NEW [09:55:24] (03PS1) 10Effie Mouzeli: mediawiki-common: add rdb2013 and rdb2014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297087 (https://phabricator.wikimedia.org/T418924) [09:56:21] 10SRE-tools, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Provide downtime duration information in sre.mysql cookbooks - https://phabricator.wikimedia.org/T427780#11980650 (10Marostegui) I think that'd be good for me yeah! Thank you all [09:56:52] (03CR) 10Ayounsi: [C:03+2] network/data.yaml: add eqiad/codfw per rack public vlans [puppet] - 10https://gerrit.wikimedia.org/r/1297068 (https://phabricator.wikimedia.org/T422043) (owner: 10Ayounsi) [09:56:56] (03CR) 10Majavah: [C:03+2] P:toolforge::redis_sentinel: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295908 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [09:57:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:57:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2196: Upgrading db2196.codfw.wmnet [09:57:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2196: Upgrading db2196.codfw.wmnet [09:59:09] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2196.codfw.wmnet with OS trixie [09:59:58] (03PS1) 10Effie Mouzeli: rdb2013: use nftables [puppet] - 10https://gerrit.wikimedia.org/r/1297088 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1000) [10:00:28] (03PS2) 10Effie Mouzeli: rdb2013: use nftables [puppet] - 10https://gerrit.wikimedia.org/r/1297088 [10:00:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P93673 and previous config saved to /var/cache/conftool/dbconfig/20260603-100029-fceratto.json [10:00:36] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297088 (owner: 10Effie Mouzeli) [10:03:17] (03PS1) 10Brouberol: dse-k8s: deploy the ceph-csi pods on dedicated workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297089 (https://phabricator.wikimedia.org/T428036) [10:03:25] (03CR) 10Joal: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [10:03:43] (03CR) 10Effie Mouzeli: [C:03+2] rdb2013: use nftables [puppet] - 10https://gerrit.wikimedia.org/r/1297088 (owner: 10Effie Mouzeli) [10:03:47] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for MobileFrontend on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297090 (https://phabricator.wikimedia.org/T425940) [10:09:49] (03CR) 10Brouberol: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [10:10:20] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2013.codfw.wmnet [10:10:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T426633)', diff saved to https://phabricator.wikimedia.org/P93675 and previous config saved to /var/cache/conftool/dbconfig/20260603-101037-fceratto.json [10:10:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:11:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1189 (T426633)', diff saved to https://phabricator.wikimedia.org/P93676 and previous config saved to /var/cache/conftool/dbconfig/20260603-101105-fceratto.json [10:11:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:11:31] RESOLVED: [5x] RedisReplicaDown: Redis replica down rdb2014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [10:14:00] (03CR) 10Btullis: [C:03+1] "Nice. Many thanks for taking care of this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297089 (https://phabricator.wikimedia.org/T428036) (owner: 10Brouberol) [10:15:14] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2196.codfw.wmnet with reason: host reimage [10:15:43] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2013.codfw.wmnet [10:15:53] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [10:16:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:19:02] (03PS1) 10Majavah: P:toolforge::redis_sentinel: Add nftables-based VRRP rule [puppet] - 10https://gerrit.wikimedia.org/r/1297093 (https://phabricator.wikimedia.org/T427799) [10:19:15] (03PS2) 10Clément Goubert: trafficserver: Remove all gateway-check config [puppet] - 10https://gerrit.wikimedia.org/r/1293704 (https://phabricator.wikimedia.org/T422937) [10:19:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T426633)', diff saved to https://phabricator.wikimedia.org/P93677 and previous config saved to /var/cache/conftool/dbconfig/20260603-101916-fceratto.json [10:19:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2196.codfw.wmnet with reason: host reimage [10:20:06] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8634/console" [puppet] - 10https://gerrit.wikimedia.org/r/1297093 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [10:20:46] (03CR) 10Brouberol: [C:03+2] dse-k8s: deploy the ceph-csi pods on dedicated workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297089 (https://phabricator.wikimedia.org/T428036) (owner: 10Brouberol) [10:20:54] (03CR) 10Btullis: [C:03+2] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:21:09] (03PS7) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) [10:21:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:21:28] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be1066 [10:21:45] (03CR) 10Cathal Mooney: [C:03+2] netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [10:22:17] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be1067 [10:22:24] jouncebot: nowandnext [10:22:24] For the next 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1000) [10:22:24] In 0 hour(s) and 37 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1100) [10:23:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297090 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [10:23:44] (03CR) 10Effie Mouzeli: [C:03+2] rest-gateway: replace rdb2009 with rdb2013 #3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294276 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:23:49] (03PS3) 10Effie Mouzeli: rest-gateway: replace rdb2009 with rdb2013 #3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294276 (https://phabricator.wikimedia.org/T418924) [10:24:01] (03Merged) 10jenkins-bot: netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [10:24:26] (03Merged) 10jenkins-bot: hCaptcha: Enable for MobileFrontend on most group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297090 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [10:24:39] (03PS2) 10Effie Mouzeli: mediawiki-common: add rdb2013 and rdb2014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297087 (https://phabricator.wikimedia.org/T418924) [10:24:53] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1297090|hCaptcha: Enable for MobileFrontend on most group1 wikis (T425940)]] [10:24:56] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [10:25:46] (03CR) 10Clément Goubert: [C:03+1] mediawiki-common: add rdb2013 and rdb2014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297087 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:25:59] (03CR) 10FNegri: [C:03+1] P:elasticsearch: Migrate inter-node traffic to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295915 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [10:26:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:26:16] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting, 13Patch-For-Review: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11980758 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one. Alert is in place and fir... [10:26:37] (03CR) 10Majavah: [V:03+1 C:03+2] P:elasticsearch: Migrate inter-node traffic to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295915 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [10:26:49] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1297090|hCaptcha: Enable for MobileFrontend on most group1 wikis (T425940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:29:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P93679 and previous config saved to /var/cache/conftool/dbconfig/20260603-102924-fceratto.json [10:30:11] (03CR) 10Effie Mouzeli: [C:03+1] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294276 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:30:18] (03CR) 10Effie Mouzeli: [C:03+2] rest-gateway: replace rdb2009 with rdb2013 #3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294276 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:30:36] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [10:30:45] 06SRE, 06Infrastructure-Foundations, 10netops: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#11980777 (10cmooney) 05Open→03Declined I'm going to close this one for now. Given we are moving the dns hosts to new vlans under T422043, during which t... [10:31:08] 06SRE, 06Infrastructure-Foundations, 10netops: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#11980784 (10cmooney) [10:31:52] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [10:31:58] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Remove all gateway-check config [puppet] - 10https://gerrit.wikimedia.org/r/1293704 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [10:32:40] (03Merged) 10jenkins-bot: rest-gateway: replace rdb2009 with rdb2013 #3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294276 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:32:46] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [10:34:01] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Pre-teardown deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294957 (https://phabricator.wikimedia.org/T426881) (owner: 10Clément Goubert) [10:34:19] (03PS1) 10Cathal Mooney: network data.yaml: correct invalid netmask for eqiad E/F sw loopbacks [puppet] - 10https://gerrit.wikimedia.org/r/1297095 [10:34:36] 10ops-codfw, 06DC-Ops: Move test host in codfw rack B3 - https://phabricator.wikimedia.org/T428041 (10ayounsi) 03NEW [10:34:59] 10ops-codfw, 06DC-Ops: Move test host in codfw rack B3 - https://phabricator.wikimedia.org/T428041#11980808 (10ayounsi) [10:35:12] effie: it's possible you get my changes to the api-gateway chart when you apply (depending how fast CI is), if so feel free to deploy it's noop for the rest-gateway [10:36:16] (03Merged) 10jenkins-bot: api-gateway: Pre-teardown deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294957 (https://phabricator.wikimedia.org/T426881) (owner: 10Clément Goubert) [10:36:56] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297090|hCaptcha: Enable for MobileFrontend on most group1 wikis (T425940)]] (duration: 12m 03s) [10:37:00] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [10:37:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2196.codfw.wmnet with OS trixie [10:37:55] (03PS1) 10Brouberol: Revert "dse-k8s-wdqs-test1001: set the node-role.kuberneteos.io node label" [puppet] - 10https://gerrit.wikimedia.org/r/1297096 [10:38:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1053: repool after upgrade [10:39:02] (03CR) 10Ayounsi: [C:03+1] network data.yaml: correct invalid netmask for eqiad E/F sw loopbacks [puppet] - 10https://gerrit.wikimedia.org/r/1297095 (owner: 10Cathal Mooney) [10:39:07] claime: cheers [10:39:15] (03CR) 10Btullis: [C:03+1] Revert "dse-k8s-wdqs-test1001: set the node-role.kuberneteos.io node label" [puppet] - 10https://gerrit.wikimedia.org/r/1297096 (owner: 10Brouberol) [10:39:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P93681 and previous config saved to /var/cache/conftool/dbconfig/20260603-103931-fceratto.json [10:39:33] (03CR) 10Brouberol: [C:03+2] Revert "dse-k8s-wdqs-test1001: set the node-role.kuberneteos.io node label" [puppet] - 10https://gerrit.wikimedia.org/r/1297096 (owner: 10Brouberol) [10:40:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:40:23] (03CR) 10Effie Mouzeli: [C:03+2] "similar to I3cae52f905c4c925c4d320e114e572050268f77e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294277 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:40:31] (03PS2) 10Effie Mouzeli: changeprop: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294277 (https://phabricator.wikimedia.org/T418924) [10:40:34] (03CR) 10CI reject: [V:04-1] changeprop: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294277 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:40:43] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:40:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:41:04] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:43:15] (03PS1) 10Effie Mouzeli: changeprop: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297098 (https://phabricator.wikimedia.org/T418924) [10:43:40] (03Abandoned) 10Effie Mouzeli: changeprop: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294277 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:44:07] (03PS2) 10Effie Mouzeli: changeprop-jobqueue: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294278 (https://phabricator.wikimedia.org/T418924) [10:44:16] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:44:36] (03CR) 10Cathal Mooney: [C:03+2] network data.yaml: correct invalid netmask for eqiad E/F sw loopbacks [puppet] - 10https://gerrit.wikimedia.org/r/1297095 (owner: 10Cathal Mooney) [10:44:37] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2196: Migration of db2196.codfw.wmnet completed [10:45:02] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:45:17] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:46:00] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: add rdb2013 and rdb2014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297087 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:48:34] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924#11980853 (10jijiki) [10:49:20] (03Merged) 10jenkins-bot: mediawiki-common: add rdb2013 and rdb2014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297087 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:49:33] (03CR) 10Mszwarc: Update UserInfoCard to be enabled by default for certain user groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289895 (https://phabricator.wikimedia.org/T426021) (owner: 10Mszwarc) [10:49:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T426633)', diff saved to https://phabricator.wikimedia.org/P93683 and previous config saved to /var/cache/conftool/dbconfig/20260603-104939-fceratto.json [10:49:46] (03PS1) 10Clément Goubert: api-gateway: Remove hardcoded routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297099 (https://phabricator.wikimedia.org/T426881) [10:49:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [10:50:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T426633)', diff saved to https://phabricator.wikimedia.org/P93684 and previous config saved to /var/cache/conftool/dbconfig/20260603-105006-fceratto.json [10:50:46] (03CR) 10Effie Mouzeli: "similar to I3cae52f905c4c925c4d320e114e572050268f77e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297098 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:50:54] (03CR) 10Effie Mouzeli: [C:03+2] changeprop: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297098 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:51:12] jouncebot: nowandnext [10:51:12] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1000) [10:51:13] In 0 hour(s) and 8 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1100) [10:51:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:51:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289895 (https://phabricator.wikimedia.org/T426021) (owner: 10Mszwarc) [10:51:57] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [10:52:22] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Remove hardcoded routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297099 (https://phabricator.wikimedia.org/T426881) (owner: 10Clément Goubert) [10:52:49] (03Merged) 10jenkins-bot: Update UserInfoCard to be enabled by default for certain user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289895 (https://phabricator.wikimedia.org/T426021) (owner: 10Mszwarc) [10:52:58] (03Merged) 10jenkins-bot: changeprop: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297098 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:53:14] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1289895|Update UserInfoCard to be enabled by default for certain user groups (T426021)]] [10:53:18] T426021: Change UIC default configuration - https://phabricator.wikimedia.org/T426021 [10:53:47] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:54:16] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:54:33] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:54:48] (03Merged) 10jenkins-bot: api-gateway: Remove hardcoded routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297099 (https://phabricator.wikimedia.org/T426881) (owner: 10Clément Goubert) [10:54:50] (03CR) 10Effie Mouzeli: [C:03+2] "similar to Ib274013abe2ffa9bff820abf78415b474f029ac8" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294278 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:55:11] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1289895|Update UserInfoCard to be enabled by default for certain user groups (T426021)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:55:30] (03PS1) 10Federico Ceratto: sre.mysql: add local ruff.toml [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 [10:55:51] (03PS2) 10SomeRandomDeveloper: Update hCaptcha checks to retrieve API parameters from $_REQUEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292364 (https://phabricator.wikimedia.org/T427105) [10:56:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:56:41] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [10:57:00] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [10:57:03] (03Merged) 10jenkins-bot: changeprop-jobqueue: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294278 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [10:57:25] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:58:02] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:58:07] (03PS17) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [10:58:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T426633)', diff saved to https://phabricator.wikimedia.org/P93685 and previous config saved to /var/cache/conftool/dbconfig/20260603-105815-fceratto.json [10:58:25] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:58:38] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:59:12] (03CR) 10CI reject: [V:04-1] sre.mysql: add local ruff.toml [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 (owner: 10Federico Ceratto) [10:59:15] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:59:39] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:59:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:59:45] (03PS2) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [10:59:51] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:00:05] mvolz: That opportune time for a Services – Citoid / Zotero deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1100). [11:00:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:00:52] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289895|Update UserInfoCard to be enabled by default for certain user groups (T426021)]] (duration: 07m 37s) [11:00:55] T426021: Change UIC default configuration - https://phabricator.wikimedia.org/T426021 [11:01:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:01:13] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:02:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:02:33] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:02:35] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:02:46] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:02:58] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:03:17] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:03:23] !incidents [11:03:24] 8048 (RESOLVED) [6x] ProbeDown sre (probes/service) [11:03:24] 8047 (RESOLVED) PHPFPMTooBusy sre (mw-web main codfw) [11:03:24] 8046 (RESOLVED) PHPFPMTooBusy sre (mw-web main codfw) [11:03:24] 8045 (RESOLVED) [8x] ProbeDown sre (probes/service) [11:03:24] 8040 (RESOLVED) Host es2050 (paged) [11:03:25] 8039 (RESOLVED) Host db2175 (paged) [11:03:25] 8042 (RESOLVED) Host db2157 (paged) [11:03:25] 8043 (RESOLVED) Host db2153 (paged) [11:03:26] 8041 (RESOLVED) Host db2154 (paged) [11:03:26] 8044 (RESOLVED) Host db2176 (paged) [11:05:09] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:06:07] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:06:15] (03PS1) 10Clément Goubert: deployment-server: Symlink clusterinfo for cluster_alias [puppet] - 10https://gerrit.wikimedia.org/r/1297101 (https://phabricator.wikimedia.org/T388969) [11:06:48] (03CR) 10CI reject: [V:04-1] deployment-server: Symlink clusterinfo for cluster_alias [puppet] - 10https://gerrit.wikimedia.org/r/1297101 (https://phabricator.wikimedia.org/T388969) (owner: 10Clément Goubert) [11:07:05] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:07:08] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be1066 [11:08:20] (03PS2) 10Cathal Mooney: nftables: place notrack rules into the /etc/nftables/prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715) [11:08:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P93687 and previous config saved to /var/cache/conftool/dbconfig/20260603-110823-fceratto.json [11:08:39] (03PS2) 10Clément Goubert: deployment-server: Symlink clusterinfo for cluster_alias [puppet] - 10https://gerrit.wikimedia.org/r/1297101 (https://phabricator.wikimedia.org/T388969) [11:08:40] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:09:03] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:09:31] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:09:33] (03PS3) 10Cathal Mooney: nftables: place notrack rules into the /etc/nftables/prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715) [11:09:33] (03CR) 10CI reject: [V:04-1] nftables: place notrack rules into the /etc/nftables/prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [11:10:43] (03CR) 10CI reject: [V:04-1] nftables: place notrack rules into the /etc/nftables/prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [11:11:11] mvernon@cumin2002 convert-disks (PID 3835234) is awaiting input [11:11:54] (03PS1) 10Slyngshede: C:dumps::web::xmldumps block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) [11:14:50] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:15:06] (03CR) 10Slyngshede: "The email and URL check feels unnecessary, who's going to modify the python-requests header and simply add their email to the generic head" [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) (owner: 10Slyngshede) [11:16:32] (03PS2) 10Federico Ceratto: sre.mysql: add local ruff.toml [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 [11:16:52] (03CR) 10Btullis: [C:03+2] logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [11:18:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P93689 and previous config saved to /var/cache/conftool/dbconfig/20260603-111831-fceratto.json [11:19:52] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:20:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:20:37] (03PS4) 10Cathal Mooney: nftables: place notrack rules into the /etc/nftables/prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715) [11:20:41] (03CR) 10CI reject: [V:04-1] sre.mysql: add local ruff.toml [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 (owner: 10Federico Ceratto) [11:20:52] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:20:58] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:21:35] (03CR) 10Clément Goubert: [C:03+2] ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [11:22:51] (03PS2) 10Cathal Mooney: nftables: remove 'notrack' directory from /etc/nftables [puppet] - 10https://gerrit.wikimedia.org/r/1259896 (https://phabricator.wikimedia.org/T420715) [11:23:45] (03Merged) 10jenkins-bot: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) (owner: 10Clément Goubert) [11:24:53] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:25:51] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [11:26:05] (03PS1) 10JavierMonton: stream: staging page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297104 (https://phabricator.wikimedia.org/T425336) [11:26:13] (03PS2) 10Cathal Mooney: nftables: remove the file definition for /etc/nftables/notrack [puppet] - 10https://gerrit.wikimedia.org/r/1259898 (https://phabricator.wikimedia.org/T420715) [11:26:17] (03CR) 10Harroyo-wmf: [C:03+1] hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [11:27:08] (03PS2) 10JavierMonton: stream: staging page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297104 (https://phabricator.wikimedia.org/T425336) [11:28:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T426633)', diff saved to https://phabricator.wikimedia.org/P93690 and previous config saved to /var/cache/conftool/dbconfig/20260603-112838-fceratto.json [11:28:50] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [11:29:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [11:29:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T426633)', diff saved to https://phabricator.wikimedia.org/P93691 and previous config saved to /var/cache/conftool/dbconfig/20260603-112909-fceratto.json [11:30:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2196: Migration of db2196.codfw.wmnet completed [11:30:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [11:30:58] (03CR) 10JavierMonton: [C:03+2] stream: staging page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297104 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton) [11:32:12] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:32:30] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:33:01] (03Merged) 10jenkins-bot: stream: staging page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297104 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton) [11:33:19] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:33:30] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:35:06] (03CR) 10Btullis: wdqs-backend: Deployment chart for the WDQS triple-store (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [11:36:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T426633)', diff saved to https://phabricator.wikimedia.org/P93693 and previous config saved to /var/cache/conftool/dbconfig/20260603-113611-fceratto.json [11:39:32] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be1067 [11:40:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1066.eqiad.wmnet with OS bullseye [11:40:07] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11981009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1066.eq... [11:40:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be1066 [11:41:01] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:41:03] jouncebot: now [11:41:03] For the next 0 hour(s) and 18 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1100) [11:41:23] dear folks, I will be failing over the docker registry to the standby one [11:42:03] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [11:42:13] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [11:42:21] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [11:42:40] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [11:42:48] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [11:43:12] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [11:45:15] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1066 - mvernon@cumin2002" [11:45:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1066 - mvernon@cumin2002" [11:45:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:45:23] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be1066.eqiad.wmnet 117.32.64.10.in-addr.arpa 7.1.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:45:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be1066.eqiad.wmnet 117.32.64.10.in-addr.arpa 7.1.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:45:28] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1066 [11:46:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1066 [11:46:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be1066 [11:46:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P93695 and previous config saved to /var/cache/conftool/dbconfig/20260603-114618-fceratto.json [11:46:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1067.eqiad.wmnet with OS bullseye [11:46:52] (03CR) 10Btullis: wdqs-backend: Deployment chart for the WDQS triple-store (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [11:46:52] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11981026 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1067.eq... [11:46:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be1067 [11:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 16.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:47:24] (03PS1) 10Ozge: feat: adds editing suggestions to ml experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 [11:47:34] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:48:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:48:35] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2186: Upgrading db2186.codfw.wmnet [11:48:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2186: Upgrading db2186.codfw.wmnet [11:49:48] (03PS2) 10Ozge: feat: adds editing suggestions to ml experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [11:50:29] (03CR) 10DCausse: [C:03+1] translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296631 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [11:52:00] (03PS1) 10Effie Mouzeli: ProductionServices.php: switch filebackend.php to rdb2013:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297110 (https://phabricator.wikimedia.org/T418261) [11:52:07] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1067 - mvernon@cumin2002" [11:52:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1067 - mvernon@cumin2002" [11:52:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:52:13] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be1067.eqiad.wmnet 96.48.64.10.in-addr.arpa 6.9.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:52:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be1067.eqiad.wmnet 96.48.64.10.in-addr.arpa 6.9.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:52:18] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1067 [11:53:21] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924#11981044 (10jijiki) [11:53:45] marostegui@cumin1003 major-upgrade (PID 726740) is awaiting input [11:54:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1067 [11:54:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be1067 [11:54:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2186.codfw.wmnet with OS trixie [11:54:42] (03CR) 10Clément Goubert: [C:03+1] ProductionServices.php: switch filebackend.php to rdb2013:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297110 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [11:56:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P93697 and previous config saved to /var/cache/conftool/dbconfig/20260603-115626-fceratto.json [11:57:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:58:10] excellent [11:58:14] (03CR) 10CWilliams: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:58:18] jouncebot: now [11:58:18] For the next 0 hour(s) and 1 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1100) [11:58:35] I am about to run a backport folks [12:00:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 18.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:02:49] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [12:03:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1066.eqiad.wmnet with reason: host reimage [12:05:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 16.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:06:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T426633)', diff saved to https://phabricator.wikimedia.org/P93698 and previous config saved to /var/cache/conftool/dbconfig/20260603-120634-fceratto.json [12:06:40] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297111 [12:07:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:07:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:07:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T426633)', diff saved to https://phabricator.wikimedia.org/P93699 and previous config saved to /var/cache/conftool/dbconfig/20260603-120732-fceratto.json [12:07:47] PROBLEM - Host ms-be1066 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:41] (03PS3) 10Ozge: feat: adds editing suggestions to ml experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [12:10:01] (03PS4) 10Ozge: feat: adds editing suggestions to ml experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [12:11:09] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11981085 (10MatthewVernon) [12:11:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1067.eqiad.wmnet with reason: host reimage [12:12:49] RECOVERY - Host ms-be1066 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [12:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:13:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: host reimage [12:13:52] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1066.eqiad.wmnet with reason: host reimage [12:14:40] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11981119 (10TheDJ) @PantheraLeo1359531 It sounds like you have a very specific and complex workflow. You are probably more aware of what is possib... [12:15:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T426633)', diff saved to https://phabricator.wikimedia.org/P93700 and previous config saved to /var/cache/conftool/dbconfig/20260603-121533-fceratto.json [12:18:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: host reimage [12:19:42] (03PS1) 10FNegri: toolsdb: reduce binlog retention [puppet] - 10https://gerrit.wikimedia.org/r/1297114 (https://phabricator.wikimedia.org/T427187) [12:20:46] (03CR) 10Dreamy Jazz: [C:03+1] Update hCaptcha checks to retrieve API parameters from $_REQUEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292364 (https://phabricator.wikimedia.org/T427105) (owner: 10SomeRandomDeveloper) [12:21:30] jouncebot: nowandnext [12:21:30] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [12:21:30] In 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1300) [12:21:35] (03CR) 10Ilias Sarantopoulos: [C:04-1] "Thanks for creating this Ozge! However this should be split in 2 patch since it involves different systems (LIftWIng and Rest gateway) and" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [12:21:35] Going to use scap [12:22:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1067.eqiad.wmnet with reason: host reimage [12:24:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292364 (https://phabricator.wikimedia.org/T427105) (owner: 10SomeRandomDeveloper) [12:24:50] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11981136 (10TheDJ) >>! In T427949#11977910, @MatthewVernon wrote: > TIFF compression can be done losslessly, so I see no reason to accept uncompre... [12:25:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P93701 and previous config saved to /var/cache/conftool/dbconfig/20260603-122541-fceratto.json [12:26:34] (03Merged) 10jenkins-bot: Update hCaptcha checks to retrieve API parameters from $_REQUEST [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292364 (https://phabricator.wikimedia.org/T427105) (owner: 10SomeRandomDeveloper) [12:27:01] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1292364|Update hCaptcha checks to retrieve API parameters from $_REQUEST (T427105)]] [12:27:08] T427105: Update hCaptcha code in mediawiki-config to no longer depend on action=visualeditoredit/discussiontoolsedit - https://phabricator.wikimedia.org/T427105 [12:28:06] PROBLEM - Host ms-be1066 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:59] !log dreamyjazz@deploy1003 somerandomdeveloper, dreamyjazz: Backport for [[gerrit:1292364|Update hCaptcha checks to retrieve API parameters from $_REQUEST (T427105)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:30:22] RECOVERY - Host ms-be1066 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [12:31:39] Dreamy_Jazz: are you good to go? [12:31:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1066.eqiad.wmnet with OS bullseye [12:31:51] I'm still testing [12:31:54] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11981172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1066.eqiad.... [12:32:02] But should be done shortly [12:33:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:34:06] !log dreamyjazz@deploy1003 somerandomdeveloper, dreamyjazz: Continuing with deployment [12:34:35] 06SRE, 13Patch-For-Review: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11981175 (10elukey) ` elukey@kafka-jumbo1010:~$ sudo -E kafka acls --remove --deny-principal User:ANONYMOUS --operation Write --topic webrequest_text Root user detected, using the broker's super user auth c... [12:35:06] (03PS8) 10Elukey: profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) [12:35:13] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [12:35:48] effie: are you deploying something after Dreamy_Jazz ? [12:35:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P93702 and previous config saved to /var/cache/conftool/dbconfig/20260603-123548-fceratto.json [12:35:57] kostajh: yes [12:36:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [12:36:11] ack [12:36:15] cheers thnks [12:36:17] I’ll wait for the UTC afternoon window then [12:36:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2186.codfw.wmnet with OS trixie [12:36:52] (03PS3) 10Dreamy Jazz: hCaptcha: Don't show AbuseFilter CAPTCHA for wbsetclaim API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296550 (https://phabricator.wikimedia.org/T427608) [12:36:57] (03PS5) 10Ozge: feat: adds editing suggestions to ml experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [12:37:36] (03PS6) 10Ozge: feat: adds editing suggestions to ml experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [12:37:51] (03CR) 10CI reject: [V:04-1] hCaptcha: Don't show AbuseFilter CAPTCHA for wbsetclaim API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296550 (https://phabricator.wikimedia.org/T427608) (owner: 10Dreamy Jazz) [12:38:09] (03CR) 10Ozge: feat: adds editing suggestions to ml experimental (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [12:38:17] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1292364|Update hCaptcha checks to retrieve API parameters from $_REQUEST (T427105)]] (duration: 11m 15s) [12:38:23] T427105: Update hCaptcha code in mediawiki-config to no longer depend on action=visualeditoredit/discussiontoolsedit - https://phabricator.wikimedia.org/T427105 [12:38:43] effie: I'm done [12:38:47] cheers [12:40:33] (03PS4) 10Dreamy Jazz: hCaptcha: Don't show AbuseFilter CAPTCHA for wbsetclaim API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296550 (https://phabricator.wikimedia.org/T427608) [12:40:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297110 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [12:41:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1067.eqiad.wmnet with OS bullseye [12:41:41] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11981212 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1067.eqiad.... [12:42:09] (03CR) 10AikoChou: "Thanks for working on this! I left a few comments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [12:42:48] (03Merged) 10jenkins-bot: ProductionServices.php: switch filebackend.php to rdb2013:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297110 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [12:43:06] (03PS1) 10Tiziano Fogli: puppetmaster: remove obsolete alerts [alerts] - 10https://gerrit.wikimedia.org/r/1297117 (https://phabricator.wikimedia.org/T426809) [12:43:13] !log jiji@deploy1003 Started scap sync-world: Backport for [[gerrit:1297110|ProductionServices.php: switch filebackend.php to rdb2013:6381 (T418261 T419976)]] [12:43:17] T418261: rdb20[11-12] implementation tracking - https://phabricator.wikimedia.org/T418261 [12:43:18] T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0) - https://phabricator.wikimedia.org/T419976 [12:43:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2186: Migration of db2186.codfw.wmnet completed [12:45:14] !log jiji@deploy1003 jiji: Backport for [[gerrit:1297110|ProductionServices.php: switch filebackend.php to rdb2013:6381 (T418261 T419976)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:45:51] (03PS1) 10MVernon: swift: restore two reimaged nodes to the eqiad rings [puppet] - 10https://gerrit.wikimedia.org/r/1297120 (https://phabricator.wikimedia.org/T421719) [12:45:53] (03PS1) 10Jgreen: Switch frack default bastion to codfw to prep for eqiad kernel upgrade. [dns] - 10https://gerrit.wikimedia.org/r/1297122 [12:45:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T426633)', diff saved to https://phabricator.wikimedia.org/P93704 and previous config saved to /var/cache/conftool/dbconfig/20260603-124556-fceratto.json [12:46:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:46:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T426633)', diff saved to https://phabricator.wikimedia.org/P93705 and previous config saved to /var/cache/conftool/dbconfig/20260603-124624-fceratto.json [12:46:46] !log jiji@deploy1003 jiji: Continuing with deployment [12:46:51] (03CR) 10Brouberol: [C:03+1] "Thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [12:47:40] (03CR) 10Jgreen: [C:03+2] Switch frack default bastion to codfw to prep for eqiad kernel upgrade. [dns] - 10https://gerrit.wikimedia.org/r/1297122 (owner: 10Jgreen) [12:47:56] !log jgreen@dns1004 START - running authdns-update [12:48:09] 06SRE, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11981231 (10jijiki) >>! In T418918#11963447, @MLechvien-WMF wrote: > @jijiki are you handling that task too as part of {https://phabricator.wikimedia.org/T419... [12:49:27] !log jgreen@dns1004 END - running authdns-update [12:50:00] (03PS1) 10Effie Mouzeli: alias.yaml: retire the old codfw redis servers [puppet] - 10https://gerrit.wikimedia.org/r/1297124 (https://phabricator.wikimedia.org/T419976) [12:50:57] !log jiji@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297110|ProductionServices.php: switch filebackend.php to rdb2013:6381 (T418261 T419976)]] (duration: 07m 44s) [12:51:03] T418261: rdb20[11-12] implementation tracking - https://phabricator.wikimedia.org/T418261 [12:51:04] T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0) - https://phabricator.wikimedia.org/T419976 [12:51:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:52:03] !ack [12:52:04] 8053 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [12:52:18] (03CR) 10Marostegui: [C:03+1] swift: restore two reimaged nodes to the eqiad rings [puppet] - 10https://gerrit.wikimedia.org/r/1297120 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [12:52:44] (03CR) 10Elukey: [C:03+2] profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [12:54:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T426633)', diff saved to https://phabricator.wikimedia.org/P93706 and previous config saved to /var/cache/conftool/dbconfig/20260603-125540-fceratto.json [12:56:25] o/ [12:56:34] (03CR) 10MVernon: [C:03+2] swift: restore two reimaged nodes to the eqiad rings [puppet] - 10https://gerrit.wikimedia.org/r/1297120 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [12:56:42] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: haproxy in Beta cluster has invalid config - https://phabricator.wikimedia.org/T428052#11981250 (10Urbanecm_WMF) [12:56:59] jhathaway: see -sre-private [12:57:06] thanks [12:57:37] (03PS1) 10Tiziano Fogli: liberica: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1297125 (https://phabricator.wikimedia.org/T426809) [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1300). [13:00:05] atsukoito, dbrant, jakob_WMDE, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:11] o/ [13:01:17] o/ [13:01:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:02:29] hello [13:02:45] effie: are you finished with deploying? [13:03:46] kostajh: yes, sorry! [13:04:42] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp2043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:04:42] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp2043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:05:42] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [13:05:47] ^ depooled, and is me [13:05:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P93708 and previous config saved to /var/cache/conftool/dbconfig/20260603-130548-fceratto.json [13:07:13] I can deploy [13:07:46] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:07:46] PROBLEM - haproxy process on cp2043 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:08:00] kostajh: should we do both of ours at once? [13:08:12] I’ll ship them separately [13:08:17] atsukoito: are you around? [13:08:24] (03PS1) 10CWilliams: Provide downtime duration information in sre.mysql cookbooks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) [13:08:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296649 (https://phabricator.wikimedia.org/T426048) (owner: 10Dbrant) [13:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:09:34] kostajh: if atsukoito does not show I can ship the patch at the end of the window [13:09:42] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp2043 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-08-05 06:34:20 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:09:42] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp2043 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-07-12 03:51:38 +0000 (expires in 38 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:09:46] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2043 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-07-06 20:52:29 +0000 (expires in 33 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:09:46] RECOVERY - haproxy process on cp2043 is OK: PROCS OK: 1 process with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:09:50] (03Merged) 10jenkins-bot: hCaptcha: Roll out to all except enwiki for mobile apps. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296649 (https://phabricator.wikimedia.org/T426048) (owner: 10Dbrant) [13:09:52] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add codfw d3 and e5 public vlans - ayounsi@cumin1003" [13:09:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add codfw d3 and e5 public vlans - ayounsi@cumin1003" [13:09:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:10:06] dcausse: thanks, as I’m a bit unclear about that one, and would want someone who understands it better to deploy it and verify it [13:10:14] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1296649|hCaptcha: Roll out to all except enwiki for mobile apps. (T426048)]] [13:10:18] T426048: Roll out hCaptcha for use on app clients for Group 2 except enwiki - All Wikipedia' except English Wikipedia - https://phabricator.wikimedia.org/T426048 [13:11:11] (03CR) 10CWilliams: "Initial patch for the downtime period. I was wondering if instead of changing the existing log messages, adding a second message that show" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [13:11:58] (03PS2) 10CWilliams: Provide downtime duration information in sre.mysql cookbooks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) [13:12:13] !log kharlan@deploy1003 dbrant, kharlan: Backport for [[gerrit:1296649|hCaptcha: Roll out to all except enwiki for mobile apps. (T426048)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:12:23] dbrant: do you need a validation step on WikimediaDebug, or shall I sync this? [13:13:08] kostajh: just checked; should be good to go [13:13:47] ok [13:13:49] !log kharlan@deploy1003 dbrant, kharlan: Continuing with deployment [13:14:22] I’ll do the other hCaptcha patch next, then will sync the Wikibase one [13:14:38] (03PS5) 10Kosta Harlan: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) [13:15:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mc10[37-54] - https://phabricator.wikimedia.org/T426303#11981277 (10Jclark-ctr) [13:15:35] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:15:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20260603-131556-fceratto.json [13:16:04] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:17:01] (03PS2) 10Blake: kubernetes-1.31: Update systemd overrides and changelog. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) [13:17:14] 10ops-codfw, 06SRE, 06DC-Ops: Move test host in codfw rack B3 or D3 - https://phabricator.wikimedia.org/T428041#11981279 (10ayounsi) [13:18:00] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296649|hCaptcha: Roll out to all except enwiki for mobile apps. (T426048)]] (duration: 07m 46s) [13:18:05] T426048: Roll out hCaptcha for use on app clients for Group 2 except enwiki - All Wikipedia' except English Wikipedia - https://phabricator.wikimedia.org/T426048 [13:18:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [13:18:28] PROBLEM - Host mc2055 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:38] (03PS1) 10Sergio Gimeno: editor: make redesigned anon warning the default experience [extensions/MobileFrontend] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297129 (https://phabricator.wikimedia.org/T424595) [13:18:55] (03PS1) 10Sergio Gimeno: editor: make redesigned anon warning the default experience [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297130 (https://phabricator.wikimedia.org/T424595) [13:19:22] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 34079848 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:19:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/MobileFrontend] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297129 (https://phabricator.wikimedia.org/T424595) (owner: 10Sergio Gimeno) [13:20:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297130 (https://phabricator.wikimedia.org/T424595) (owner: 10Sergio Gimeno) [13:20:08] (03Merged) 10jenkins-bot: hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [13:20:20] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 11624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:20:33] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1295910|hCaptcha: Roll out self-hosted secure-api.js to all wikis (T403829)]] [13:20:47] hi, I just added a couple of changes to the window queue. I'm ok self-deploying at the end [13:21:14] kostajh: can you ping me when you're done? [13:22:35] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1295910|hCaptcha: Roll out self-hosted secure-api.js to all wikis (T403829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:22:37] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [13:23:57] !log kharlan@deploy1003 kharlan: Continuing with deployment [13:25:17] sergi0: yes, will let you know. [13:25:31] ty! [13:25:42] !log sudo cumin 'A:lvs or A:liberica' 'disable-puppet "merging CR 1282764"' [13:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T426633)', diff saved to https://phabricator.wikimedia.org/P93710 and previous config saved to /var/cache/conftool/dbconfig/20260603-132605-fceratto.json [13:26:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [13:26:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T426633)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20260603-132638-fceratto.json [13:28:10] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295910|hCaptcha: Roll out self-hosted secure-api.js to all wikis (T403829)]] (duration: 07m 36s) [13:28:10] jakob_WMDE: I’ll sync your patch next [13:28:16] thanks! [13:29:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297058 (https://phabricator.wikimedia.org/T427935) (owner: 10Jakob) [13:29:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2186: Migration of db2186.codfw.wmnet completed [13:29:19] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [13:30:52] (03PS1) 10CDanis: cache: haproxy: add enable_mlock for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1297131 [13:31:05] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297131 (owner: 10CDanis) [13:32:28] (03Abandoned) 10Sergio Gimeno: editor: make redesigned anon warning the default experience [extensions/MobileFrontend] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297129 (https://phabricator.wikimedia.org/T424595) (owner: 10Sergio Gimeno) [13:33:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T426633)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20260603-133440-fceratto.json [13:35:04] (03CR) 10Elukey: [C:03+1] puppetmaster: remove obsolete alerts [alerts] - 10https://gerrit.wikimedia.org/r/1297117 (https://phabricator.wikimedia.org/T426809) (owner: 10Tiziano Fogli) [13:35:05] (03CR) 10Ssingh: [V:03+1 C:03+2] LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [13:35:09] (03CR) 10Ssingh: [C:03+2] ulsfo LVS: peer with the ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282731 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [13:35:39] (03PS2) 10CDanis: cache: haproxy: add enable_mlock for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1297131 [13:35:43] (03PS2) 10Eevans: linked-artifacts: update for production deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296683 (https://phabricator.wikimedia.org/T414140) [13:35:45] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297131 (owner: 10CDanis) [13:35:51] (03PS7) 10Ssingh: LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [13:36:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:36:27] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:36:41] (03CR) 10Elukey: "I don't have a preference, this patch LGTM, maybe Riccardo has some opinions?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [13:36:49] jouncebot: next [13:36:49] In 0 hour(s) and 23 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1400) [13:38:01] (03CR) 10Ssingh: [C:03+2] LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [13:38:02] (03CR) 10Ssingh: [V:03+2 C:03+2] LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [13:38:10] (03CR) 10Ssingh: [C:03+2] LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [13:38:36] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:38:57] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:39:05] (03CR) 10Ssingh: [C:03+1] "PCC looks good, systemd option looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1297131 (owner: 10CDanis) [13:39:21] (03PS3) 10CDanis: cache: haproxy: enable_mlock for systemd unit & 🚀eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1297131 [13:39:32] (03CR) 10CDanis: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1297131/6947/" [puppet] - 10https://gerrit.wikimedia.org/r/1297131 (owner: 10CDanis) [13:39:35] (03CR) 10Ssingh: cache: haproxy: enable_mlock for systemd unit & 🚀eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1297131 (owner: 10CDanis) [13:42:09] (03Merged) 10jenkins-bot: Search: Disable redundant search limit validation [extensions/Wikibase] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297058 (https://phabricator.wikimedia.org/T427935) (owner: 10Jakob) [13:42:36] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1297058|Search: Disable redundant search limit validation (T427935 T427695)]] [13:43:26] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:44:45] !log kharlan@deploy1003 jakob, kharlan: Backport for [[gerrit:1297058|Search: Disable redundant search limit validation (T427935 T427695)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:44:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260603-134448-fceratto.json [13:45:08] kostajh: tested, works! [13:45:23] jakob_WMDE: ok syncing [13:45:27] !log kharlan@deploy1003 jakob, kharlan: Continuing with deployment [13:46:49] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:03] !ack [13:47:03] 8054 (ACKED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [13:47:48] federico3: see _security [13:48:20] yes, looking [13:48:22] (03CR) 10Bartosz Wójtowicz: [C:03+1] linked-artifacts: update for production deploy (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296683 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [13:49:35] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [13:49:42] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297058|Search: Disable redundant search limit validation (T427935 T427695)]] (duration: 07m 05s) [13:49:55] dcausse: over to you [13:50:00] kostajh: thanks [13:50:14] atsukoito: going to ship the config change [13:50:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296631 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:51:28] (03PS7) 10Ozge: feat: adds editing suggestions to ml experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [13:51:49] RESOLVED: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:52] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:51:52] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:51:52] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:51:52] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:52:05] (03Merged) 10jenkins-bot: translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296631 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:52:07] !incidents [13:52:08] 8054 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [13:52:08] 8053 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [13:52:08] 8048 (RESOLVED) [6x] ProbeDown sre (probes/service) [13:52:08] 8047 (RESOLVED) PHPFPMTooBusy sre (mw-web main codfw) [13:52:08] 8046 (RESOLVED) PHPFPMTooBusy sre (mw-web main codfw) [13:52:09] 8045 (RESOLVED) [8x] ProbeDown sre (probes/service) [13:52:14] (03CR) 10Ozge: "thank you @achou@wikimedia.org and @isarantopoulos@wikimedia.org I've addressed your comments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [13:52:33] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1296631|translate: adding separate read/write endpoints (T425377)]] [13:53:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:29] (03PS1) 10Matthias Mullie: Revert "MultimediaViewer: enable image carousel as a beta feature on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297137 [13:53:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297137 (owner: 10Matthias Mullie) [13:53:49] (03PS8) 10Ozge: feat: adds editing suggestions to ml experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [13:53:58] * sergi0 will lurk until issue is fixed [13:54:29] !log dcausse@deploy1003 atsuko, dcausse: Backport for [[gerrit:1296631|translate: adding separate read/write endpoints (T425377)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:55:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P93713 and previous config saved to /var/cache/conftool/dbconfig/20260603-135500-fceratto.json [13:55:46] (03PS1) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297136 [13:56:07] !log dcausse@deploy1003 atsuko, dcausse: Rolling back deployment [13:56:35] 06SRE, 06Infrastructure-Foundations: Build spicerack for Trixie - https://phabricator.wikimedia.org/T428024#11981407 (10elukey) Our packages have already their Trixie variant, so we are good. Among the other changes, I'd check: * python3-kubernetes - we had some issues in the past when migrating, so it is wor... [13:57:08] (03PS1) 10DCausse: Revert "translate: adding separate read/write endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297140 [13:57:40] 10ops-codfw, 06DC-Ops: codfw: move public baremetal servers to per rack vlan - https://phabricator.wikimedia.org/T428060 (10ayounsi) 03NEW [13:57:46] (03CR) 10DCausse: [C:03+2] Revert "translate: adding separate read/write endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297140 (owner: 10DCausse) [13:58:23] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:58:32] 06SRE, 06Infrastructure-Foundations: Build spicerack for Trixie - https://phabricator.wikimedia.org/T428024#11981425 (10elukey) [13:58:48] (03Merged) 10jenkins-bot: Revert "translate: adding separate read/write endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297140 (owner: 10DCausse) [13:58:48] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:58:50] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:59:18] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-05-19-171108 to 2026-06-03-023342 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297141 (https://phabricator.wikimedia.org/T411110) [13:59:26] 10ops-codfw, 06DC-Ops: codfw: move public baremetal servers to per rack vlan - https://phabricator.wikimedia.org/T428060#11981431 (10ayounsi) [13:59:32] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:59:34] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:59:44] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-05-19-223625 to 2026-06-03-020126 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297142 (https://phabricator.wikimedia.org/T411110) [13:59:46] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:59:56] I'm done deploying, sergi0 unsure you'll have time to ship anything in this window :( [14:00:00] 10ops-codfw, 06SRE, 06DC-Ops: Move test host in codfw rack B3 or D3 - https://phabricator.wikimedia.org/T428041#11981443 (10ayounsi) [14:00:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1400) [14:00:18] (03PS1) 10Jforrester: wikifunctions: Stop setting createCustomSpans on evaluators, ignored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297143 [14:00:21] 06SRE, 06Infrastructure-Foundations: Build spicerack for Trixie - https://phabricator.wikimedia.org/T428024#11981449 (10elukey) [14:00:52] 10ops-codfw, 06SRE, 06DC-Ops: Move test host in codfw rack B3 or D3 - https://phabricator.wikimedia.org/T428041#11981459 (10ayounsi) [14:01:06] jouncebot: next [14:01:06] In 0 hour(s) and 28 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1430) [14:01:24] (03PS1) 10Brouberol: kubernetes/dse-k8s-eqiad: Define a kafka-ui kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1297144 (https://phabricator.wikimedia.org/T428053) [14:01:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:01:40] thank you dcausse, not sure if it is problematic for me to go ahead, objections? [14:01:58] jouncebot: now [14:01:58] For the next 0 hour(s) and 58 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1400) [14:02:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11981479 (10VRiley-WMF) @MatthewVernon is it okay to proceed? If so, which one should I start with? [14:02:34] sergi0: there's a window running at the moment so not sure [14:03:04] weird there seems to be 2 windows overlapping? [14:03:09] jouncebot: nowandnext [14:03:09] For the next 0 hour(s) and 56 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1400) [14:03:09] In 0 hour(s) and 26 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1430) [14:03:33] indeed, I can defer my change, no probs [14:04:55] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2026-05-19-171108 to 2026-06-03-023342 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297141 (https://phabricator.wikimedia.org/T411110) (owner: 10Jforrester) [14:05:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T426633)', diff saved to https://phabricator.wikimedia.org/P93714 and previous config saved to /var/cache/conftool/dbconfig/20260603-140507-fceratto.json [14:05:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [14:05:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T426633)', diff saved to https://phabricator.wikimedia.org/P93715 and previous config saved to /var/cache/conftool/dbconfig/20260603-140537-fceratto.json [14:05:39] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296631|translate: adding separate read/write endpoints (T425377)]] (duration: 13m 06s) [14:05:44] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [14:06:33] (03PS1) 10Brouberol: Define the kafka-ui internal DNS records [dns] - 10https://gerrit.wikimedia.org/r/1297146 (https://phabricator.wikimedia.org/T428053) [14:06:57] (03CR) 10Effie Mouzeli: [C:03+1] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297136 (owner: 10Scott French) [14:06:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mc10[37-54] - https://phabricator.wikimedia.org/T426303#11981503 (10Jclark-ctr) [14:07:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mc10[37-54] - https://phabricator.wikimedia.org/T426303#11981504 (10Jclark-ctr) 05In progress→03Resolved [14:07:39] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-05-19-171108 to 2026-06-03-023342 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297141 (https://phabricator.wikimedia.org/T411110) (owner: 10Jforrester) [14:07:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc2055 [14:08:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host mc2055 [14:08:11] I don't see any backports in course so I'm assuming it's ok [14:08:27] (03CR) 10Scott French: [C:03+1] confd: Replace deprecated fact [puppet] - 10https://gerrit.wikimedia.org/r/1296536 (owner: 10Majavah) [14:09:33] (03PS1) 10Brouberol: Define the kafka.w.o public record [dns] - 10https://gerrit.wikimedia.org/r/1297148 (https://phabricator.wikimedia.org/T428053) [14:09:34] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:09:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297130 (https://phabricator.wikimedia.org/T424595) (owner: 10Sergio Gimeno) [14:10:36] (03PS1) 10Brouberol: dse-k8s-eqiad: define the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297150 (https://phabricator.wikimedia.org/T428053) [14:10:37] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:10:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc2055.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:10:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc2055.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:11:03] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:32] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc2055 [14:11:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host mc2055 [14:11:52] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:11:52] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:11:52] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:12:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T426633)', diff saved to https://phabricator.wikimedia.org/P93716 and previous config saved to /var/cache/conftool/dbconfig/20260603-141242-fceratto.json [14:12:55] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:13:03] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:16:04] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:16:27] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host mc2055 [14:16:30] !log vriley@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host mc2055 [14:17:07] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-05-19-223625 to 2026-06-03-020126 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297142 (https://phabricator.wikimedia.org/T411110) (owner: 10Jforrester) [14:19:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc2055 [14:19:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host mc2055 [14:19:24] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-05-19-223625 to 2026-06-03-020126 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297142 (https://phabricator.wikimedia.org/T411110) (owner: 10Jforrester) [14:20:02] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc2055 [14:20:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host mc2055 [14:20:12] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:20:36] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:20:41] (03Merged) 10jenkins-bot: editor: make redesigned anon warning the default experience [extensions/MobileFrontend] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297130 (https://phabricator.wikimedia.org/T424595) (owner: 10Sergio Gimeno) [14:21:40] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:21:56] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host mc2055 [14:21:59] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host mc2055 [14:22:22] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:22:30] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:22:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P93717 and previous config saved to /var/cache/conftool/dbconfig/20260603-142251-fceratto.json [14:23:04] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:23:34] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1297130|editor: make redesigned anon warning the default experience (T424595)]] [14:23:38] T424595: Scale updated logged-out edit warning on mobile to all wikis - https://phabricator.wikimedia.org/T424595 [14:24:51] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host mc2055 [14:24:53] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host mc2055 [14:25:32] !log sgimeno@deploy1003 sgimeno: Backport for [[gerrit:1297130|editor: make redesigned anon warning the default experience (T424595)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:26:06] * sergi0 tests [14:28:08] !log sgimeno@deploy1003 sgimeno: Continuing with deployment [14:29:16] (03CR) 10Btullis: [C:03+1] Define the kafka.w.o public record [dns] - 10https://gerrit.wikimedia.org/r/1297148 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [14:29:18] (03CR) 10Scott French: "Thanks for proposing this! Interesting approach - it didn't occur to me that we could delegate this to systemd this way." [puppet] - 10https://gerrit.wikimedia.org/r/1296537 (https://phabricator.wikimedia.org/T356296) (owner: 10Majavah) [14:29:59] (03CR) 10Btullis: [C:03+1] kubernetes/dse-k8s-eqiad: Define a kafka-ui kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1297144 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1430) [14:30:15] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host thanos-be1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:30:35] (03PS1) 10Jforrester: wikifunctions: Correct configs for Rust versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297155 [14:30:39] Anyone from Wikifunctions around? [14:30:44] (03CR) 10Jforrester: [C:03+2] wikifunctions: Stop setting createCustomSpans on evaluators, ignored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297143 (owner: 10Jforrester) [14:31:09] I also have an urgent config patch that I would like to scap ASAP - would that be possible? Already got OK from TestKitchen (who share next half hr slot with Wikifunctions) [14:31:24] cc James_F? [14:31:45] matthiasmullie: Go for it, our window is services-only. [14:32:24] PROBLEM - Host thanos-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:41] James_F: thanks [14:32:48] sergi0: can you ping me when you're done? [14:32:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P93719 and previous config saved to /var/cache/conftool/dbconfig/20260603-143259-fceratto.json [14:33:09] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Correct configs for Rust versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297155 (owner: 10Jforrester) [14:33:30] (03Merged) 10jenkins-bot: wikifunctions: Stop setting createCustomSpans on evaluators, ignored [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297143 (owner: 10Jforrester) [14:33:31] matthiasmullie: sure, we're at ~50% [14:34:20] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297130|editor: make redesigned anon warning the default experience (T424595)]] (duration: 10m 45s) [14:34:23] T424595: Scale updated logged-out edit warning on mobile to all wikis - https://phabricator.wikimedia.org/T424595 [14:34:38] matthiasmullie: all yours [14:34:48] thanks! [14:35:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2031 [14:35:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297137 (owner: 10Matthias Mullie) [14:35:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wdqs2031 [14:35:40] (03Merged) 10jenkins-bot: wikifunctions: Correct configs for Rust versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297155 (owner: 10Jforrester) [14:36:52] RECOVERY - Host thanos-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:38:34] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:39:17] vriley@cumin1003 provision (PID 843054) is awaiting input [14:39:24] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:39:33] (03PS1) 10Urbanecm: [Growth] wikidatawiki: Enable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) [14:39:41] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:39:43] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host mc2055 [14:39:45] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host mc2055 [14:40:11] (03CR) 10Urbanecm: [C:04-2] "scheduled for the week of 15th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) (owner: 10Urbanecm) [14:40:58] (03Merged) 10jenkins-bot: Revert "MultimediaViewer: enable image carousel as a beta feature on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297137 (owner: 10Matthias Mullie) [14:41:21] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1297137|Revert "MultimediaViewer: enable image carousel as a beta feature on Wikipedias"]] [14:41:33] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:41:39] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:42:10] (03PS1) 10Matthias Mullie: MultimediaViewer: enable image carousel as a beta feature on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297162 [14:42:16] (03CR) 10Eevans: [C:03+2] linked-artifacts: update for production deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296683 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [14:42:38] PROBLEM - Host thanos-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:40] (03CR) 10Matthias Mullie: [C:04-1] "Awaiting confirmation; DNM for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297162 (owner: 10Matthias Mullie) [14:43:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T426633)', diff saved to https://phabricator.wikimedia.org/P93721 and previous config saved to /var/cache/conftool/dbconfig/20260603-144306-fceratto.json [14:43:21] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1297137|Revert "MultimediaViewer: enable image carousel as a beta feature on Wikipedias"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:43:22] (03CR) 10Urbanecm: growthexperiments.pp: Run cleanMentorList every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1296519 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [14:43:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [14:43:30] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:43:31] (03CR) 10Urbanecm: "This is now ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/1296519 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [14:43:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T426633)', diff saved to https://phabricator.wikimedia.org/P93722 and previous config saved to /var/cache/conftool/dbconfig/20260603-144334-fceratto.json [14:43:42] (03PS1) 10Ayounsi: Overide CertAlmostExpired for network devices [alerts] - 10https://gerrit.wikimedia.org/r/1297163 [14:43:57] !log mlitn@deploy1003 mlitn: Continuing with deployment [14:44:04] (03CR) 10Majavah: [C:03+2] confd: Replace deprecated fact [puppet] - 10https://gerrit.wikimedia.org/r/1296536 (owner: 10Majavah) [14:45:05] (03Merged) 10jenkins-bot: linked-artifacts: update for production deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296683 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [14:45:06] (03PS1) 10CDanis: cache: haproxy: enable_mlock 🚀esams [puppet] - 10https://gerrit.wikimedia.org/r/1297164 [14:46:04] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [14:46:16] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [14:46:19] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297164 (owner: 10CDanis) [14:46:33] (03PS1) 10Slyngshede: data.yaml Extend sarmbruster [puppet] - 10https://gerrit.wikimedia.org/r/1297165 [14:47:38] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/linked-artifacts: apply [14:47:50] (03CR) 10Slyngshede: [C:03+2] data.yaml Extend sarmbruster [puppet] - 10https://gerrit.wikimedia.org/r/1297165 (owner: 10Slyngshede) [14:48:07] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297137|Revert "MultimediaViewer: enable image carousel as a beta feature on Wikipedias"]] (duration: 06m 46s) [14:48:26] Done deploying - thanks for letting me move forward, y'all! [14:48:30] (03PS18) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [14:48:36] (03CR) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [14:49:02] (03PS1) 10Gkyziridis: ml-services: add liftwing-openapi-server deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) [14:49:09] (03CR) 10Volans: confd: Add condition to prevent starting without configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296537 (https://phabricator.wikimedia.org/T356296) (owner: 10Majavah) [14:49:58] (03PS3) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [14:50:00] RECOVERY - Host thanos-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:50:25] (03PS1) 10Gkyziridis: liftwing-openapi-server: Add new admin_ng service for serving OpenAPI specs [puppet] - 10https://gerrit.wikimedia.org/r/1297168 (https://phabricator.wikimedia.org/T427902) [14:50:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T426633)', diff saved to https://phabricator.wikimedia.org/P93723 and previous config saved to /var/cache/conftool/dbconfig/20260603-145039-fceratto.json [14:50:47] (03CR) 10CI reject: [V:04-1] ml-services: add liftwing-openapi-server deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [14:51:43] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:51:54] RECOVERY - Host mc2055 is UP: PING OK - Packet loss = 0%, RTA = 31.49 ms [14:52:00] (03CR) 10Andrew Bogott: [C:03+1] toolsdb: reduce binlog retention [puppet] - 10https://gerrit.wikimedia.org/r/1297114 (https://phabricator.wikimedia.org/T427187) (owner: 10FNegri) [14:52:57] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox [14:53:01] (03CR) 10FNegri: [C:03+2] toolsdb: reduce binlog retention [puppet] - 10https://gerrit.wikimedia.org/r/1297114 (https://phabricator.wikimedia.org/T427187) (owner: 10FNegri) [14:55:46] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:56:48] (03PS2) 10Gkyziridis: ml-services: add liftwing-openapi-server deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) [14:57:45] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linked-artifacts: apply [14:58:19] (03PS2) 10Brouberol: Define the kafka-ui internal DNS records [dns] - 10https://gerrit.wikimedia.org/r/1297146 (https://phabricator.wikimedia.org/T428053) [14:58:19] (03PS2) 10Brouberol: Define the kafka.w.o public record [dns] - 10https://gerrit.wikimedia.org/r/1297148 (https://phabricator.wikimedia.org/T428053) [14:59:09] (03PS19) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [14:59:29] (03PS2) 10Brouberol: kubernetes/dse-k8s: Define a kafka-ui kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1297144 (https://phabricator.wikimedia.org/T428053) [14:59:59] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM. The only thing I noticed, if you're interested in adding it, is that the test case for the critical alert scenario is missing." [alerts] - 10https://gerrit.wikimedia.org/r/1297163 (owner: 10Ayounsi) [15:00:05] mutante and hashar: Deploy window Jenkins switchover/upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1500) [15:00:27] (03PS4) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [15:00:42] (03CR) 10Michael Große: "This should probably also include testwikidatawiki (https://test.wikidata.org/wiki/Wikidata:Main_Page). I think there is also a dblist `wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) (owner: 10Urbanecm) [15:00:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P93725 and previous config saved to /var/cache/conftool/dbconfig/20260603-150047-fceratto.json [15:00:50] (03PS2) 10Brouberol: dse-k8s-eqiad: define the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297150 (https://phabricator.wikimedia.org/T428053) [15:01:13] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host thanos-be1007.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:01:22] (03CR) 10Ssingh: [C:03+1] cache: haproxy: enable_mlock 🚀esams [puppet] - 10https://gerrit.wikimedia.org/r/1297164 (owner: 10CDanis) [15:01:53] (03CR) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [15:02:15] (03CR) 10Btullis: [C:03+1] Define the kafka-ui internal DNS records [dns] - 10https://gerrit.wikimedia.org/r/1297146 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [15:03:08] (03CR) 10Btullis: [C:03+1] "nit: commit message still refers to eqiad only." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297150 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [15:03:22] PROBLEM - Host thanos-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:00] (03CR) 10Kamila Součková: [C:03+1] deployment-server: Symlink clusterinfo for cluster_alias [puppet] - 10https://gerrit.wikimedia.org/r/1297101 (https://phabricator.wikimedia.org/T388969) (owner: 10Clément Goubert) [15:04:01] (03PS2) 10Urbanecm: [Growth] wikidatawiki: Enable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) [15:04:13] (03CR) 10Urbanecm: "Good point. Updated." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) (owner: 10Urbanecm) [15:06:10] (03PS3) 10Brouberol: dse-k8s: define the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297150 (https://phabricator.wikimedia.org/T428053) [15:07:52] RECOVERY - Host thanos-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:09:46] PROBLEM - Host thanos-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:46] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:10:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P93726 and previous config saved to /var/cache/conftool/dbconfig/20260603-151055-fceratto.json [15:16:03] (03CR) 10Brouberol: [C:03+2] Define the kafka-ui internal DNS records [dns] - 10https://gerrit.wikimedia.org/r/1297146 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [15:16:14] (03CR) 10Brouberol: [C:03+2] Define the kafka.w.o public record [dns] - 10https://gerrit.wikimedia.org/r/1297148 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [15:16:33] !log brouberol@dns1004 START - running authdns-update [15:16:51] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:17:07] RECOVERY - Host thanos-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:18:00] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1007.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:18:04] !log brouberol@dns1004 END - running authdns-update [15:18:10] FIRING: BFDdown: BFD session down between cr1-drmrs and fe80::8618:88ff:fe0d:dc64 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:18:49] (03CR) 10Brouberol: [C:03+2] dse-k8s: define the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297150 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [15:19:00] (03CR) 10Brouberol: [C:03+2] kubernetes/dse-k8s: Define a kafka-ui kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1297144 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [15:19:05] please expect some scheduled CI downtime for max the next 40 min (jenkins switchover, planned) [15:19:42] (03PS2) 10Ayounsi: Overide CertAlmostExpired for network devices [alerts] - 10https://gerrit.wikimedia.org/r/1297163 [15:20:52] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2012 to codfw - jhancock@cumin2002" [15:21:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T426633)', diff saved to https://phabricator.wikimedia.org/P93727 and previous config saved to /var/cache/conftool/dbconfig/20260603-152102-fceratto.json [15:21:04] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:21:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2012 to codfw - jhancock@cumin2002" [15:21:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:21:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:21:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T426633)', diff saved to https://phabricator.wikimedia.org/P93728 and previous config saved to /var/cache/conftool/dbconfig/20260603-152129-fceratto.json [15:21:45] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:22:46] (03CR) 10Dzahn: [C:03+2] contint: disable jenkins on legacy CI hosts [puppet] - 10https://gerrit.wikimedia.org/r/1273919 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [15:22:47] (03PS1) 10Eevans: linked-artifacts: fix broken external-service names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297172 (https://phabricator.wikimedia.org/T414140) [15:23:10] RESOLVED: BFDdown: BFD session down between cr1-drmrs and fe80::8618:88ff:fe0d:dc64 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:23:35] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2012 [15:23:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2012 [15:23:41] !log disabling jenkins on CI servers for maintenance [15:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:03] (03CR) 10Michael Große: [C:03+1] "To me, it looks ready to be deployed when the time comes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) (owner: 10Urbanecm) [15:24:11] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host thanos-be1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:24:55] (03PS1) 10Harroyo-wmf: hCaptcha: Enable risk-score collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) [15:25:23] (03CR) 10CDanis: [C:03+2] cache: haproxy: enable_mlock 🚀esams [puppet] - 10https://gerrit.wikimedia.org/r/1297164 (owner: 10CDanis) [15:25:29] PROBLEM - Host thanos-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:37] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2012 [15:25:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2012 [15:25:45] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2012 [15:25:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2012 [15:26:23] (03PS2) 10Harroyo-wmf: hCaptcha: Enable risk-score collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) [15:26:38] (03CR) 10Eevans: [C:03+2] linked-artifacts: fix broken external-service names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297172 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [15:26:52] (03PS3) 10Harroyo-wmf: hCaptcha: Enable risk-score collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) [15:28:29] PROBLEM - jenkins_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:28:33] (03PS4) 10Harroyo-wmf: hCaptcha: Enable risk-score collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) [15:28:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T426633)', diff saved to https://phabricator.wikimedia.org/P93729 and previous config saved to /var/cache/conftool/dbconfig/20260603-152836-fceratto.json [15:30:31] RECOVERY - Host thanos-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [15:31:32] (03CR) 10Dzahn: [C:03+2] integration: switch integration-agent-docker VMs to Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [15:32:26] (03PS5) 10Kosta Harlan: hCaptcha: Enable risk-score collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) (owner: 10Harroyo-wmf) [15:32:37] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:33:15] ^ scheduled maintenance [15:33:23] PROBLEM - Host thanos-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:29] RECOVERY - jenkins_service_running on contint1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:38:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P93731 and previous config saved to /var/cache/conftool/dbconfig/20260603-153844-fceratto.json [15:38:47] (03CR) 10Eevans: [V:03+2 C:03+2] linked-artifacts: fix broken external-service names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297172 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [15:39:30] jouncebot: nowandnext [15:39:30] For the next 0 hour(s) and 20 minute(s): Jenkins switchover/upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1500) [15:39:30] In 1 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1700) [15:39:49] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/linked-artifacts: apply [15:39:51] RECOVERY - Host thanos-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [15:40:02] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linked-artifacts: apply [15:40:17] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/linked-artifacts: apply [15:40:31] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/linked-artifacts: apply [15:40:45] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:42:01] (03CR) 10Kosta Harlan: [C:03+1] hCaptcha: Enable risk-score collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) (owner: 10Harroyo-wmf) [15:45:14] (03CR) 10Dzahn: [C:03+2] ci: switch jenkins proxy target to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/1254308 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [15:46:14] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2012.wikimedia.org with OS trixie [15:46:48] (03CR) 10Majavah: [V:03+1] confd: Add condition to prevent starting without configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296537 (https://phabricator.wikimedia.org/T356296) (owner: 10Majavah) [15:46:48] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host thanos-be1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:47:37] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:48:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P93732 and previous config saved to /var/cache/conftool/dbconfig/20260603-154852-fceratto.json [15:49:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:49:23] PROBLEM - Host thanos-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [15:49:31] (03PS1) 10Elukey: WIP: upgrade to Trixie [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297184 [15:49:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:51:26] (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [15:52:29] PROBLEM - jenkins_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:52:36] (03PS2) 10Elukey: WIP: upgrade to Trixie [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297184 [15:52:51] RECOVERY - Host thanos-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [15:54:35] PROBLEM - Host thanos-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:43] (03PS1) 10Dzahn: ci: update port for jenkins to 1443, using envoy now [puppet] - 10https://gerrit.wikimedia.org/r/1297185 (https://phabricator.wikimedia.org/T418521) [15:58:04] (03CR) 10CI reject: [V:04-1] ci: update port for jenkins to 1443, using envoy now [puppet] - 10https://gerrit.wikimedia.org/r/1297185 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [15:59:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T426633)', diff saved to https://phabricator.wikimedia.org/P93733 and previous config saved to /var/cache/conftool/dbconfig/20260603-155859-fceratto.json [15:59:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [15:59:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T426633)', diff saved to https://phabricator.wikimedia.org/P93734 and previous config saved to /var/cache/conftool/dbconfig/20260603-155928-fceratto.json [15:59:48] (03PS2) 10Dzahn: ci: update port for jenkins to 1443, using envoy now [puppet] - 10https://gerrit.wikimedia.org/r/1297185 (https://phabricator.wikimedia.org/T418521) [16:00:07] (03CR) 10Dzahn: [C:03+2] ci: update port for jenkins to 1443, using envoy now [puppet] - 10https://gerrit.wikimedia.org/r/1297185 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:00:20] (03CR) 10Dzahn: [V:03+2 C:03+2] ci: update port for jenkins to 1443, using envoy now [puppet] - 10https://gerrit.wikimedia.org/r/1297185 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:02:51] RECOVERY - Host thanos-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:04:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:04:29] RECOVERY - jenkins_service_running on contint1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [16:05:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by btullis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295922 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [16:06:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T426633)', diff saved to https://phabricator.wikimedia.org/P93735 and previous config saved to /var/cache/conftool/dbconfig/20260603-160635-fceratto.json [16:07:29] PROBLEM - jenkins_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [16:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:01] (03PS1) 10JMeybohm: reuse-parts.sh: Allow to reuse swap with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1297186 (https://phabricator.wikimedia.org/T428078) [16:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 18.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:14:57] (03PS3) 10Elukey: WIP: upgrade to Trixie [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297184 [16:16:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P93736 and previous config saved to /var/cache/conftool/dbconfig/20260603-161643-fceratto.json [16:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:17:58] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:19:11] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:23:21] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:23:35] (03PS1) 10Dzahn: Revert "ci: switch jenkins proxy target to new discovery name" [puppet] - 10https://gerrit.wikimedia.org/r/1297190 [16:23:59] (03CR) 10CI reject: [V:04-1] Revert "ci: switch jenkins proxy target to new discovery name" [puppet] - 10https://gerrit.wikimedia.org/r/1297190 (owner: 10Dzahn) [16:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:24:43] (03PS5) 10Kamila Součková: admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) [16:25:10] FIRING: ProbeDown: Service contint1002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:25] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:25:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:25:57] (03PS1) 10Kamila Součková: admin: update ssh key for migr [puppet] - 10https://gerrit.wikimedia.org/r/1297191 (https://phabricator.wikimedia.org/T428037) [16:26:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P93737 and previous config saved to /var/cache/conftool/dbconfig/20260603-162650-fceratto.json [16:28:16] (03PS1) 10Dzahn: ci: revert back to jenkins on existing host [puppet] - 10https://gerrit.wikimedia.org/r/1297192 (https://phabricator.wikimedia.org/T418521) [16:29:01] (03PS5) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [16:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:29:30] (03CR) 10Dzahn: [V:03+2 C:03+2] ci: revert back to jenkins on existing host [puppet] - 10https://gerrit.wikimedia.org/r/1297192 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:30:00] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11982289 (10VRiley-WMF) [16:31:09] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11982295 (10VRiley-WMF) [16:33:36] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:33:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:03] (03PS1) 10Dzahn: Revert "integration: switch integration-agent-docker VMs to Java 21" [puppet] - 10https://gerrit.wikimedia.org/r/1297193 [16:34:11] (03CR) 10Dzahn: [V:03+2 C:03+2] Revert "integration: switch integration-agent-docker VMs to Java 21" [puppet] - 10https://gerrit.wikimedia.org/r/1297193 (owner: 10Dzahn) [16:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:35:29] RECOVERY - jenkins_service_running on contint1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [16:35:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:18] (03PS6) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [16:36:48] (03Merged) 10jenkins-bot: Declare the webrequest.dumps.dev0 stream in EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295922 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [16:36:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T426633)', diff saved to https://phabricator.wikimedia.org/P93738 and previous config saved to /var/cache/conftool/dbconfig/20260603-163658-fceratto.json [16:37:16] !log btullis@deploy1003 Started scap sync-world: Backport for [[gerrit:1295922|Declare the webrequest.dumps.dev0 stream in EventStreamConfig (T291645 T425087)]] [16:37:19] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [16:37:21] T291645: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645 [16:37:22] T425087: Send JSON access logs for dumps.wikimedia.org to Kafka - https://phabricator.wikimedia.org/T425087 [16:37:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T426633)', diff saved to https://phabricator.wikimedia.org/P93739 and previous config saved to /var/cache/conftool/dbconfig/20260603-163726-fceratto.json [16:38:44] ayounsi@cumin1003 reimage (PID 895660) is awaiting input [16:39:16] !log btullis@deploy1003 btullis: Backport for [[gerrit:1295922|Declare the webrequest.dumps.dev0 stream in EventStreamConfig (T291645 T425087)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:40:10] RESOLVED: ProbeDown: Service contint1002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:40:20] !log btullis@deploy1003 btullis: Continuing with deployment [16:41:29] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296687 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [16:41:59] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:42:19] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:43:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:43:39] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:44:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T426633)', diff saved to https://phabricator.wikimedia.org/P93740 and previous config saved to /var/cache/conftool/dbconfig/20260603-164428-fceratto.json [16:44:32] !log btullis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295922|Declare the webrequest.dumps.dev0 stream in EventStreamConfig (T291645 T425087)]] (duration: 07m 16s) [16:44:37] T291645: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645 [16:44:38] T425087: Send JSON access logs for dumps.wikimedia.org to Kafka - https://phabricator.wikimedia.org/T425087 [16:46:32] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:47:42] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:48:20] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:48:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:48:57] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:50:37] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:51:08] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:51:15] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:51:31] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:51:38] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:51:54] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:52:01] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:52:16] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:52:24] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:52:38] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:53:36] !log Restarting CI Jenkins one last time # T418521 [16:53:36] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:53:39] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Beta cluster haproxy does not support `warn-blocked-traffic-after` keyword - https://phabricator.wikimedia.org/T428052#11982409 (10ssingh) >>! In T428052#11982183, @bd808 wrote: > The Beta Cluster cache nodes are Debian Bullseye running HAProxy version 2.8.18-... [16:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:39] T418521: setup 2 contint machines for jenkins - https://phabricator.wikimedia.org/T418521 [16:54:00] 10ops-codfw, 06SRE, 06DC-Ops: codfw: move public baremetal servers to per rack vlan - https://phabricator.wikimedia.org/T428060#11982415 (10Ladsgroup) mailman is now owned by #collaboration-services ! [16:54:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P93741 and previous config saved to /var/cache/conftool/dbconfig/20260603-165436-fceratto.json [16:54:41] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:42] (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297136 (owner: 10Scott French) [17:00:05] swfrench-wmf: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1700). [17:00:12] o/ [17:00:20] I'll be getting started shortly [17:01:11] !oncall-now [17:01:11] Oncall now for team SRE, rotation 247_policy: [17:01:11] t.appof, j.hathaway [17:01:27] (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297136 (owner: 10Scott French) [17:03:19] !log swfrench@deploy1003 Started scap sync-world: No-deploy scap run to verify clean state before config change [17:03:48] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11982491 (10Ladsgroup) File: {T428086} [17:04:03] !log swfrench@deploy1003 Stopping before sync operations [17:04:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P93742 and previous config saved to /var/cache/conftool/dbconfig/20260603-170444-fceratto.json [17:05:09] (03CR) 10Scott French: [C:03+2] scap.cfg.erb: Temporarily pin mediawiki_runtime_image [puppet] - 10https://gerrit.wikimedia.org/r/1296036 (https://phabricator.wikimedia.org/T418200) (owner: 10Scott French) [17:06:15] (03PS1) 10Kevin Bazira: ml-services: add cope-b-a4b isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297199 (https://phabricator.wikimedia.org/T427497) [17:08:53] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [17:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:09:16] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:09:17] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:09:28] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:09:29] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:09:30] (03PS1) 10Kosta Harlan: hCaptcha: Render a fresh mobile widget for each captcha attempt [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297200 (https://phabricator.wikimedia.org/T425929) [17:09:41] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:09:43] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:09:43] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2012.wikimedia.org with OS trixie [17:09:55] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:09:56] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [17:10:13] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [17:10:14] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:10:36] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:12:46] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:12:48] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11982519 (10Nemoralis) I think one of the questions that needs to be asked here is, are many of these files really needed? Some may have education... [17:13:24] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:13:56] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:14:26] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:14:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T426633)', diff saved to https://phabricator.wikimedia.org/P93743 and previous config saved to /var/cache/conftool/dbconfig/20260603-171452-fceratto.json [17:14:58] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:15:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1253.eqiad.wmnet with reason: Maintenance [17:15:14] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:15:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T426633)', diff saved to https://phabricator.wikimedia.org/P93744 and previous config saved to /var/cache/conftool/dbconfig/20260603-171521-fceratto.json [17:15:45] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:16:10] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:16:18] (03PS1) 10JMeybohm: reuse-raid10-6dev.cfg: Fix swap reuse and grub-install on all disks [puppet] - 10https://gerrit.wikimedia.org/r/1297201 (https://phabricator.wikimedia.org/T428078) [17:16:41] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:17:04] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:17:04] !log swfrench@deploy1003 Started scap sync-world: No-deploy scap run to verify scap config change [17:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:17:35] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:17:48] !log swfrench@deploy1003 Stopping before sync operations [17:18:36] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:19:37] (03PS2) 10Jcrespo: backup: Add job ids for read-only backups [puppet] - 10https://gerrit.wikimedia.org/r/1297081 (https://phabricator.wikimedia.org/T424661) [17:23:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T426633)', diff saved to https://phabricator.wikimedia.org/P93745 and previous config saved to /var/cache/conftool/dbconfig/20260603-172319-fceratto.json [17:23:59] (03CR) 10JMeybohm: "Currently I have only tested this with trixie." [puppet] - 10https://gerrit.wikimedia.org/r/1297186 (https://phabricator.wikimedia.org/T428078) (owner: 10JMeybohm) [17:26:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:29:59] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5032.* [17:31:02] (03CR) 10RLazarus: [C:04-1] "We should be building these on our own infra, whether at CI time or in the image. Vendoring the code in and building in CI is one way to d" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427989) (owner: 10Jforrester) [17:32:31] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:33:15] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:33:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P93746 and previous config saved to /var/cache/conftool/dbconfig/20260603-173327-fceratto.json [17:33:46] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:34:20] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:34:51] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:35:09] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:35:40] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:35:58] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:36:29] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:36:57] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:37:28] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:37:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:38:32] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: add cope-b-a4b isvc to experimental ns (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297199 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [17:38:34] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:43:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P93747 and previous config saved to /var/cache/conftool/dbconfig/20260603-174334-fceratto.json [17:45:49] * swfrench-wmf is done with the infra window [17:52:19] !log contint1003: sudo puppet agent --disable "Prevent Jenkins from coming back" [17:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:37] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:53:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T426633)', diff saved to https://phabricator.wikimedia.org/P93748 and previous config saved to /var/cache/conftool/dbconfig/20260603-175342-fceratto.json [17:55:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [17:55:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T426633)', diff saved to https://phabricator.wikimedia.org/P93749 and previous config saved to /var/cache/conftool/dbconfig/20260603-175544-fceratto.json [17:57:57] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5032.* [18:00:05] dancy and jnuche: MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T1800). Please do the needful. [18:00:09] o/ [18:00:28] hashar: Is CI in a healthy state? [18:00:38] hopefully yeah [18:00:44] we did a miration, rolled it back [18:01:03] so we are back to the previous state and I think it is all fine now [18:01:10] oh [18:01:13] that is for running the train! [18:01:18] Yes [18:01:37] if CI managed to merge changes to mediawiki*, then it is fine [18:01:47] and it did https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1293728 [18:01:53] so yeah +1 on doing train [18:01:55] I am around still [18:02:02] Alright. Pressing the button! Thanks hashar [18:02:26] +1 ;) [18:02:40] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297207 (https://phabricator.wikimedia.org/T423914) [18:02:43] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297207 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [18:03:46] (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297207 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [18:04:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T426633)', diff saved to https://phabricator.wikimedia.org/P93750 and previous config saved to /var/cache/conftool/dbconfig/20260603-180404-fceratto.json [18:08:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqsin and NTT (116.51.26.209) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:09:45] (03CR) 10Ssingh: "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1296654 (https://phabricator.wikimedia.org/T117618) (owner: 10SBassett) [18:10:05] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.47.0-wmf.5 refs T423914 [18:10:10] T423914: 1.47.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T423914 [18:14:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P93751 and previous config saved to /var/cache/conftool/dbconfig/20260603-181412-fceratto.json [18:17:03] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11982659 (10Ladsgroup) >>! In T427949#11982519, @Nemoralis wrote: > I think one of the questions that needs to be asked here is, are many of these... [18:19:12] (03CR) 10Ssingh: [C:03+2] varnish: Add CSP report-only directives for all of upload.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1296654 (https://phabricator.wikimedia.org/T117618) (owner: 10SBassett) [18:24:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P93752 and previous config saved to /var/cache/conftool/dbconfig/20260603-182420-fceratto.json [18:30:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/0 (Transit: NTT (284967) {#1119}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:34:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T426633)', diff saved to https://phabricator.wikimedia.org/P93753 and previous config saved to /var/cache/conftool/dbconfig/20260603-183427-fceratto.json [18:34:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [18:34:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T426633)', diff saved to https://phabricator.wikimedia.org/P93754 and previous config saved to /var/cache/conftool/dbconfig/20260603-183455-fceratto.json [18:35:16] (03PS1) 10BCornwall: Remove digicert CAA records from most domains [dns] - 10https://gerrit.wikimedia.org/r/1297210 [18:40:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/0 (Transit: NTT (284967) {#1119}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:41:34] (03CR) 10Ssingh: "Looks good to me at least. Will need the input of fr-tech which I know you are reaching out to, so once they approve, let's merge." [dns] - 10https://gerrit.wikimedia.org/r/1297210 (owner: 10BCornwall) [18:43:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T426633)', diff saved to https://phabricator.wikimedia.org/P93755 and previous config saved to /var/cache/conftool/dbconfig/20260603-184324-fceratto.json [18:43:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-eqsin and NTT (116.51.26.209) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:45:10] (03PS1) 10CDanis: cache: haproxy: enable_mlock 🚀ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1297212 [18:50:24] (03PS2) 10BCornwall: Remove digicert CAA records from most domains [dns] - 10https://gerrit.wikimedia.org/r/1297210 (https://phabricator.wikimedia.org/T428093) [18:51:11] (03PS1) 10Cwhite: logstash: route all access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1297214 (https://phabricator.wikimedia.org/T291645) [18:53:06] (03PS1) 10Dzahn: contint: switch apache proxying to jenkins to use https [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) [18:53:22] (03CR) 10CI reject: [V:04-1] logstash: route all access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1297214 (https://phabricator.wikimedia.org/T291645) (owner: 10Cwhite) [18:53:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P93756 and previous config saved to /var/cache/conftool/dbconfig/20260603-185331-fceratto.json [18:53:53] (03PS1) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [18:54:13] (03CR) 10CI reject: [V:04-1] varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [18:55:16] (03PS2) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [18:56:18] (03CR) 10Dzahn: "https://httpd.apache.org/docs/2.4/mod/mod_ssl.html#sslproxyengine" [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:59:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:46] (03CR) 10Ssingh: [C:03+1] cache: haproxy: enable_mlock 🚀ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1297212 (owner: 10CDanis) [18:59:51] (03CR) 10CDanis: [C:03+2] cache: haproxy: enable_mlock 🚀ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1297212 (owner: 10CDanis) [19:03:14] 06SRE, 10SRE-Access-Requests: Requesting access to Cassandra staging for akhatun - https://phabricator.wikimedia.org/T427701#11982867 (10Ahoelzl) Approved. [19:03:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P93757 and previous config saved to /var/cache/conftool/dbconfig/20260603-190340-fceratto.json [19:03:53] (03CR) 10Volans: confd: Add condition to prevent starting without configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296537 (https://phabricator.wikimedia.org/T356296) (owner: 10Majavah) [19:13:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T426633)', diff saved to https://phabricator.wikimedia.org/P93758 and previous config saved to /var/cache/conftool/dbconfig/20260603-191348-fceratto.json [19:14:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [19:14:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1024-1025].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:14:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T426633)', diff saved to https://phabricator.wikimedia.org/P93759 and previous config saved to /var/cache/conftool/dbconfig/20260603-191437-fceratto.json [19:14:50] (03PS1) 10Eevans: linked-artifacts: upgrade to v1.4.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297219 (https://phabricator.wikimedia.org/T414140) [19:18:04] (03CR) 10Eevans: [C:03+2] linked-artifacts: upgrade to v1.4.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297219 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [19:20:14] (03Merged) 10jenkins-bot: linked-artifacts: upgrade to v1.4.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297219 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [19:20:55] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11982896 (10BCornwall) When pooling cp5032 I noticed that connection to kafka-jumbo1016.eqiad.wmnet:9093 (`10.64.154.15 via 10.132.1.1 de... [19:22:05] (03CR) 10JHathaway: "I think that makes sense, but I worry we won't get back to it." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [19:22:07] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [19:22:44] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [19:22:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T426633)', diff saved to https://phabricator.wikimedia.org/P93760 and previous config saved to /var/cache/conftool/dbconfig/20260603-192250-fceratto.json [19:25:15] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/linked-artifacts: apply [19:25:35] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linked-artifacts: apply [19:25:58] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/linked-artifacts: apply [19:26:20] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/linked-artifacts: apply [19:28:44] (03PS1) 10BCornwall: common: Update cp5032 IP address [puppet] - 10https://gerrit.wikimedia.org/r/1297221 (https://phabricator.wikimedia.org/T427393) [19:30:33] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8640/console" [puppet] - 10https://gerrit.wikimedia.org/r/1297221 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [19:31:36] (03CR) 10CDobbins: [C:03+1] common: Update cp5032 IP address [puppet] - 10https://gerrit.wikimedia.org/r/1297221 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [19:31:45] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11982927 (10Andrew) > I just can't download anything without getting 429s, and on my own laptop it ooms a lot given the size of these files. Runni... [19:32:23] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8641/co" [puppet] - 10https://gerrit.wikimedia.org/r/1297221 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [19:32:38] (03CR) 10BCornwall: [V:03+1 C:03+2] common: Update cp5032 IP address [puppet] - 10https://gerrit.wikimedia.org/r/1297221 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [19:32:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P93761 and previous config saved to /var/cache/conftool/dbconfig/20260603-193258-fceratto.json [19:34:23] (03PS2) 10Cwhite: logstash: route all access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1297214 (https://phabricator.wikimedia.org/T291645) [19:35:22] (03CR) 10Bartosz Dziewoński: "Also removed from global groups:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [19:37:23] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11982956 (10BCornwall) @ayounsi helpfully pointed out that I needed to update hieradata/common.yaml with the new IP addresses. Thanks! [19:37:48] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5032.* [19:39:28] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5032.* [19:41:47] (03PS1) 10BCornwall: common: Fix cp5032 IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/1297222 (https://phabricator.wikimedia.org/T427393) [19:42:13] (03PS3) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [19:43:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P93762 and previous config saved to /var/cache/conftool/dbconfig/20260603-194306-fceratto.json [19:44:52] (03CR) 10CDobbins: [C:03+1] common: Fix cp5032 IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/1297222 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [19:45:04] (03CR) 10BCornwall: [C:03+2] common: Fix cp5032 IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/1297222 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [19:45:19] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11982982 (10Ladsgroup) Creating a ticket to request a temporary cloud VPS project for it is in my todo list for today. I hope I can get to it ASAP... [19:47:38] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5032.* [19:50:37] (03CR) 10Cwhite: [C:03+2] logstash: route all access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1297214 (https://phabricator.wikimedia.org/T291645) (owner: 10Cwhite) [19:52:09] (03PS1) 10Eevans: linked-artifacts: egress to inference service (production) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297224 (https://phabricator.wikimedia.org/T414140) [19:53:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T426633)', diff saved to https://phabricator.wikimedia.org/P93763 and previous config saved to /var/cache/conftool/dbconfig/20260603-195313-fceratto.json [19:53:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [19:53:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T426633)', diff saved to https://phabricator.wikimedia.org/P93764 and previous config saved to /var/cache/conftool/dbconfig/20260603-195341-fceratto.json [19:55:15] (03CR) 10Eevans: [C:03+2] linked-artifacts: egress to inference service (production) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297224 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [19:57:34] (03Merged) 10jenkins-bot: linked-artifacts: egress to inference service (production) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297224 (https://phabricator.wikimedia.org/T414140) (owner: 10Eevans) [19:59:19] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/linked-artifacts: apply [19:59:25] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linked-artifacts: apply [19:59:38] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/linked-artifacts: apply [19:59:45] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/linked-artifacts: apply [20:00:18] (03PS2) 10Jforrester: abstractwiki-rust: Add rust-clippy and clang to the toolchain [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427998) [20:02:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T426633)', diff saved to https://phabricator.wikimedia.org/P93765 and previous config saved to /var/cache/conftool/dbconfig/20260603-200203-fceratto.json [20:12:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P93766 and previous config saved to /var/cache/conftool/dbconfig/20260603-201211-fceratto.json [20:16:27] ok if i backport something now? [20:18:06] jouncebot: nowandnext [20:18:06] For the next 0 hour(s) and 41 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T2000) [20:18:06] In 0 hour(s) and 41 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T2100) [20:18:15] cjming: I think go ahead, no one else seems to be using it [20:18:30] cool - just waiting for something to merge and i'll add it to the deployment cal [20:18:41] (03PS2) 10Scott French: php8.3: Rebuild 8.3 image stack on bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295044 (https://phabricator.wikimedia.org/T418200) [20:19:35] (03CR) 10Scott French: [V:03+2] "Built locally: https://phabricator.wikimedia.org/T427312#11964352" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1295044 (https://phabricator.wikimedia.org/T418200) (owner: 10Scott French) [20:21:38] (03PS1) 10Clare Ming: Attribution research don't use testKitchen compatibility layer [extensions/WikimediaEvents] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297228 (https://phabricator.wikimedia.org/T417050) [20:22:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297228 (https://phabricator.wikimedia.org/T417050) (owner: 10Clare Ming) [20:22:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P93769 and previous config saved to /var/cache/conftool/dbconfig/20260603-202219-fceratto.json [20:22:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297228 (https://phabricator.wikimedia.org/T417050) (owner: 10Clare Ming) [20:26:31] (03Merged) 10jenkins-bot: Attribution research don't use testKitchen compatibility layer [extensions/WikimediaEvents] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297228 (https://phabricator.wikimedia.org/T417050) (owner: 10Clare Ming) [20:26:57] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1297228|Attribution research don't use testKitchen compatibility layer (T417050)]] [20:27:01] T417050: Attribution Research: Instrument pageviews - https://phabricator.wikimedia.org/T417050 [20:29:07] !log cjming@deploy1003 cjming: Backport for [[gerrit:1297228|Attribution research don't use testKitchen compatibility layer (T417050)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:29:29] !log cjming@deploy1003 cjming: Continuing with deployment [20:32:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T426633)', diff saved to https://phabricator.wikimedia.org/P93770 and previous config saved to /var/cache/conftool/dbconfig/20260603-203227-fceratto.json [20:32:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [20:32:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T426633)', diff saved to https://phabricator.wikimedia.org/P93771 and previous config saved to /var/cache/conftool/dbconfig/20260603-203254-fceratto.json [20:33:38] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297228|Attribution research don't use testKitchen compatibility layer (T417050)]] (duration: 06m 41s) [20:33:42] T417050: Attribution Research: Instrument pageviews - https://phabricator.wikimedia.org/T417050 [20:34:24] cool - all done [20:38:04] (03CR) 10CDobbins: "Hey, I noticed that `report-uri` has been deprecated (https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Content-Security" [puppet] - 10https://gerrit.wikimedia.org/r/1296654 (https://phabricator.wikimedia.org/T117618) (owner: 10SBassett) [20:41:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T426633)', diff saved to https://phabricator.wikimedia.org/P93772 and previous config saved to /var/cache/conftool/dbconfig/20260603-204115-fceratto.json [20:45:05] (03CR) 10SBassett: "Yes, that should be fine. We recently implemented report-to in MediaWiki's core's CSP implementation: I5195fa9b3. Though technically we'" [puppet] - 10https://gerrit.wikimedia.org/r/1296654 (https://phabricator.wikimedia.org/T117618) (owner: 10SBassett) [20:48:44] (03PS1) 10Cathal Mooney: sre.hosts: Add eqsin old names to LEGACY_VLANS to support move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1297232 (https://phabricator.wikimedia.org/T427393) [20:50:05] (03PS2) 10Cathal Mooney: sre.hosts: Add eqsin old names to LEGACY_VLANS to support move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1297232 (https://phabricator.wikimedia.org/T427393) [20:51:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P93773 and previous config saved to /var/cache/conftool/dbconfig/20260603-205122-fceratto.json [20:55:54] (03PS3) 10Jforrester: abstractwiki-rust: Add rust-clippy and clang to the toolchain [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427998) [20:55:54] (03CR) 10Jforrester: abstractwiki-rust: Add rust-clippy and clang to the toolchain (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427998) (owner: 10Jforrester) [20:55:54] (03PS1) 10Jforrester: abstractwiki-rust: Bake in cargo-chef, built offline from vendored sources [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1297234 (https://phabricator.wikimedia.org/T427990) [20:59:48] (03CR) 10Hashar: contint: switch apache proxying to jenkins to use https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:59:51] (03CR) 10Jforrester: abstractwiki-rust: Add rust-clippy and clang to the toolchain (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427998) (owner: 10Jforrester) [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T2100) [21:01:25] (03CR) 10Jforrester: "@rzl: Was this what you were thinking for offline/vendored build? It's a bit big, but this does indeed produce a static, offline build tha" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1297234 (https://phabricator.wikimedia.org/T427990) (owner: 10Jforrester) [21:01:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P93774 and previous config saved to /var/cache/conftool/dbconfig/20260603-210130-fceratto.json [21:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:11:04] (03PS2) 10Dzahn: contint: switch apache proxying to jenkins to use https [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) [21:11:04] (03CR) 10Dzahn: contint: switch apache proxying to jenkins to use https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:11:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T426633)', diff saved to https://phabricator.wikimedia.org/P93778 and previous config saved to /var/cache/conftool/dbconfig/20260603-211138-fceratto.json [21:11:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [21:12:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T426633)', diff saved to https://phabricator.wikimedia.org/P93779 and previous config saved to /var/cache/conftool/dbconfig/20260603-211206-fceratto.json [21:15:35] (03PS1) 10Dzahn: jenkins: ensure service is absent on new jenkins host [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) [21:20:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T426633)', diff saved to https://phabricator.wikimedia.org/P93781 and previous config saved to /var/cache/conftool/dbconfig/20260603-212030-fceratto.json [21:22:45] (03PS1) 10BCornwall: wmf-config: Add new private1-eqsin subnets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) [21:26:10] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11983173 (10BCornwall) [21:26:55] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11983180 (10BCornwall) I was advised by @taavi to also update mediawiki-config's `wmf-config/reverse-proxy.php` ranges. I've updated th... [21:27:34] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11983181 (10BCornwall) [21:30:31] (03CR) 10Scott French: confd: Add condition to prevent starting without configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296537 (https://phabricator.wikimedia.org/T356296) (owner: 10Majavah) [21:30:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P93782 and previous config saved to /var/cache/conftool/dbconfig/20260603-213038-fceratto.json [21:32:31] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:35:22] (03CR) 10RLazarus: "Not with the source tree directly in this repo, no. If you look at the README and see what we do for Go binaries, you can do something ana" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1297234 (https://phabricator.wikimedia.org/T427990) (owner: 10Jforrester) [21:37:31] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:40:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P93783 and previous config saved to /var/cache/conftool/dbconfig/20260603-214046-fceratto.json [21:50:04] (03PS6) 10Dreamy Jazz: hCaptcha: Enable risk-score collection for users blocked by IP blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) (owner: 10Harroyo-wmf) [21:50:39] (03CR) 10Dreamy Jazz: [C:03+1] "Probably want to carefully watch for any issues during and after the deploy if we are not letting QA have a look first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) (owner: 10Harroyo-wmf) [21:50:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T426633)', diff saved to https://phabricator.wikimedia.org/P93784 and previous config saved to /var/cache/conftool/dbconfig/20260603-215053-fceratto.json [21:51:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [21:51:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T426633)', diff saved to https://phabricator.wikimedia.org/P93785 and previous config saved to /var/cache/conftool/dbconfig/20260603-215110-fceratto.json [21:51:16] (03CR) 10Dzahn: [C:04-2] jenkins: ensure service is absent on new jenkins host [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:52:31] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:57:31] RESOLVED: [2x] Traffic bill over quota: Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:59:36] (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T2200) [22:00:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T426633)', diff saved to https://phabricator.wikimedia.org/P93786 and previous config saved to /var/cache/conftool/dbconfig/20260603-220026-fceratto.json [22:10:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P93787 and previous config saved to /var/cache/conftool/dbconfig/20260603-221034-fceratto.json [22:16:42] (03PS2) 10Dzahn: jenkins: ensure service is absent on new jenkins host [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) [22:17:03] (03CR) 10Dzahn: jenkins: ensure service is absent on new jenkins host [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [22:17:36] (03CR) 10Dzahn: "since this is in Hiera on level of "role::jenkins" (NOT role::ci) it will only affect both new hosts and not the old hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [22:18:20] (03CR) 10Dzahn: [C:03+2] site: add releases[12]004 with collab insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1296687 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [22:18:44] (03CR) 10Scott French: [C:03+1] mesh.configuration: Add restricted_listeners (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:20:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P93788 and previous config saved to /var/cache/conftool/dbconfig/20260603-222041-fceratto.json [22:25:17] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11983299 (10Ladsgroup) {T428102} to run the bot [22:30:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T426633)', diff saved to https://phabricator.wikimedia.org/P93789 and previous config saved to /var/cache/conftool/dbconfig/20260603-223048-fceratto.json [22:31:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [22:31:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1244 (T426633)', diff saved to https://phabricator.wikimedia.org/P93790 and previous config saved to /var/cache/conftool/dbconfig/20260603-223116-fceratto.json [22:39:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T426633)', diff saved to https://phabricator.wikimedia.org/P93791 and previous config saved to /var/cache/conftool/dbconfig/20260603-223937-fceratto.json [22:41:11] (03PS1) 10Clare Ming: test-kitchen: Update chart to add new config properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297249 (https://phabricator.wikimedia.org/T428017) [22:49:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P93792 and previous config saved to /var/cache/conftool/dbconfig/20260603-224945-fceratto.json [22:52:48] (03CR) 10Scott French: mesh.service: Add TLS service ports for restricted_listeners (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:59:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P93793 and previous config saved to /var/cache/conftool/dbconfig/20260603-225953-fceratto.json [23:02:36] (03PS2) 10Clare Ming: test-kitchen: Update chart to add new config properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297249 (https://phabricator.wikimedia.org/T428017) [23:05:05] (03CR) 10Santiago Faci: [C:03+2] test-kitchen: Update chart to add new config properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297249 (https://phabricator.wikimedia.org/T428017) (owner: 10Clare Ming) [23:06:47] (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [23:06:48] (03CR) 10RLazarus: [V:03+2 C:03+2] "Tested locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427998) (owner: 10Jforrester) [23:06:58] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.4.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297251 (https://phabricator.wikimedia.org/T428017) [23:07:23] (03Merged) 10jenkins-bot: test-kitchen: Update chart to add new config properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297249 (https://phabricator.wikimedia.org/T428017) (owner: 10Clare Ming) [23:07:36] (03PS2) 10Clare Ming: Test Kitchen UI: Deploy v1.4.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297251 (https://phabricator.wikimedia.org/T428017) [23:10:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T426633)', diff saved to https://phabricator.wikimedia.org/P93794 and previous config saved to /var/cache/conftool/dbconfig/20260603-231001-fceratto.json [23:10:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [23:10:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T426633)', diff saved to https://phabricator.wikimedia.org/P93795 and previous config saved to /var/cache/conftool/dbconfig/20260603-231031-fceratto.json [23:17:37] RECOVERY - MD RAID on centrallog1002 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:18:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T426633)', diff saved to https://phabricator.wikimedia.org/P93796 and previous config saved to /var/cache/conftool/dbconfig/20260603-231844-fceratto.json [23:18:48] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.4.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297251 (https://phabricator.wikimedia.org/T428017) (owner: 10Clare Ming) [23:20:49] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.4.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297251 (https://phabricator.wikimedia.org/T428017) (owner: 10Clare Ming) [23:22:12] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [23:22:30] (03PS1) 10Creynolds: dumps: Clarify download types and refresh HTML dumps references [puppet] - 10https://gerrit.wikimedia.org/r/1297256 [23:22:38] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [23:24:29] (03CR) 10Creynolds: "Post our conversation mention of dumps I felt like a small optimization sweep was in order... clarify some things and nicer UX." [puppet] - 10https://gerrit.wikimedia.org/r/1297256 (owner: 10Creynolds) [23:28:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P93797 and previous config saved to /var/cache/conftool/dbconfig/20260603-232852-fceratto.json [23:31:13] jouncebot: nowandnext [23:31:13] No deployments scheduled for the next 6 hour(s) and 28 minute(s) [23:31:13] In 6 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T0600) [23:31:14] In 6 hour(s) and 28 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T0600) [23:31:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/timeline] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296561 (owner: 10Reedy) [23:31:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296560 (owner: 10Reedy) [23:34:25] (03Merged) 10jenkins-bot: Add a maintenance script to delete old files [extensions/timeline] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296561 (owner: 10Reedy) [23:34:26] (03Merged) 10jenkins-bot: Add a maintenance script to delete old files [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296560 (owner: 10Reedy) [23:34:58] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1296561|Add a maintenance script to delete old files]], [[gerrit:1296560|Add a maintenance script to delete old files]] [23:36:54] !log ladsgroup@deploy1003 ladsgroup, reedy: Backport for [[gerrit:1296561|Add a maintenance script to delete old files]], [[gerrit:1296560|Add a maintenance script to delete old files]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:37:53] !log ladsgroup@deploy1003 ladsgroup, reedy: Continuing with deployment [23:39:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P93798 and previous config saved to /var/cache/conftool/dbconfig/20260603-233859-fceratto.json [23:39:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1297257 [23:39:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1297257 (owner: 10TrainBranchBot) [23:42:07] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296561|Add a maintenance script to delete old files]], [[gerrit:1296560|Add a maintenance script to delete old files]] (duration: 07m 09s) [23:47:50] (03CR) 10Catrope: "Please be careful to not make the same mistake we made: `report-to` does not take a URL, it takes a symbolic name of a reporting endpoint." [puppet] - 10https://gerrit.wikimedia.org/r/1296654 (https://phabricator.wikimedia.org/T117618) (owner: 10SBassett) [23:49:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T426633)', diff saved to https://phabricator.wikimedia.org/P93799 and previous config saved to /var/cache/conftool/dbconfig/20260603-234907-fceratto.json [23:49:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [23:49:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T426633)', diff saved to https://phabricator.wikimedia.org/P93800 and previous config saved to /var/cache/conftool/dbconfig/20260603-234935-fceratto.json [23:51:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1297257 (owner: 10TrainBranchBot) [23:57:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T426633)', diff saved to https://phabricator.wikimedia.org/P93801 and previous config saved to /var/cache/conftool/dbconfig/20260603-235758-fceratto.json