[00:05:40] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:18:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:06:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:09:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1292911 [01:09:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1292911 (owner: 10TrainBranchBot) [01:11:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:15:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:20:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1292911 (owner: 10TrainBranchBot) [02:00:44] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:04:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:27] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 43s) [02:09:14] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:40] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:47:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1013.eqiad.wmnet: Maintenance on pc3 [05:17:27] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:17:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:17:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1013.eqiad.wmnet: Maintenance on pc3 [05:18:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2023.codfw.wmnet,pc[1013,1023].eqiad.wmnet with reason: Maintenance on pc3 [05:20:40] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:33] (03PS1) 10Marostegui: mariadb: Productionize pc1023 [puppet] - 10https://gerrit.wikimedia.org/r/1292913 (https://phabricator.wikimedia.org/T418973) [05:26:43] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc1023 [puppet] - 10https://gerrit.wikimedia.org/r/1292913 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [06:08:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1027.eqiad.wmnet with reason: Reboot [06:15:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2233.codfw.wmnet with reason: Reboot upgrade m2 [06:15:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2160.codfw.wmnet with reason: Reboot upgrade m2 [06:17:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:17:16] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [06:17:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2233.codfw.wmnet with reason: Reimage to Trixie [06:19:19] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2233.codfw.wmnet with OS trixie [06:21:54] (03PS1) 10KartikMistry: Update cxserver to 2026-05-24-103047-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292915 (https://phabricator.wikimedia.org/T426808) [06:35:40] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2233.codfw.wmnet with reason: host reimage [06:40:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2233.codfw.wmnet with reason: host reimage [06:42:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [06:43:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2186 (T426633)', diff saved to https://phabricator.wikimedia.org/P92828 and previous config saved to /var/cache/conftool/dbconfig/20260525-064305-fceratto.json [06:49:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2186 (T426633)', diff saved to https://phabricator.wikimedia.org/P92829 and previous config saved to /var/cache/conftool/dbconfig/20260525-064902-fceratto.json [06:59:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2186', diff saved to https://phabricator.wikimedia.org/P92830 and previous config saved to /var/cache/conftool/dbconfig/20260525-065909-fceratto.json [06:59:53] (03CR) 10Elukey: [C:03+2] docker_registry: move the /ml prefix to its new S3 backend [puppet] - 10https://gerrit.wikimedia.org/r/1290808 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [07:00:05] Amir1, urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:03:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2233.codfw.wmnet with OS trixie [07:04:15] (03CR) 10KartikMistry: [C:03+1] Article Guidance: enable experiment on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290813 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [07:09:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2186', diff saved to https://phabricator.wikimedia.org/P92831 and previous config saved to /var/cache/conftool/dbconfig/20260525-070917-fceratto.json [07:17:05] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team, 13Patch-For-Review: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11951607 (10elukey) 05Open→03Resolved a:03elukey Change deployed! I tested a Docker pull and every... [07:17:20] (03PS16) 10Arnaudb: vrts: add dual SpamAssassin and Rspamd training [puppet] - 10https://gerrit.wikimedia.org/r/1251331 (https://phabricator.wikimedia.org/T402260) [07:17:29] (03CR) 10Arnaudb: [C:03+2] vrts: add dual SpamAssassin and Rspamd training [puppet] - 10https://gerrit.wikimedia.org/r/1251331 (https://phabricator.wikimedia.org/T402260) (owner: 10Arnaudb) [07:19:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2186 (T426633)', diff saved to https://phabricator.wikimedia.org/P92832 and previous config saved to /var/cache/conftool/dbconfig/20260525-071924-fceratto.json [07:19:46] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2196.codfw.wmnet with reason: Maintenance [07:19:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2196 (T426633)', diff saved to https://phabricator.wikimedia.org/P92833 and previous config saved to /var/cache/conftool/dbconfig/20260525-071953-fceratto.json [07:26:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2196 (T426633)', diff saved to https://phabricator.wikimedia.org/P92834 and previous config saved to /var/cache/conftool/dbconfig/20260525-072645-fceratto.json [07:36:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2196', diff saved to https://phabricator.wikimedia.org/P92835 and previous config saved to /var/cache/conftool/dbconfig/20260525-073653-fceratto.json [07:43:04] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#11951689 (10ABran-WMF) [07:47:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2196', diff saved to https://phabricator.wikimedia.org/P92836 and previous config saved to /var/cache/conftool/dbconfig/20260525-074700-fceratto.json [07:47:40] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2196 (T426633)', diff saved to https://phabricator.wikimedia.org/P92837 and previous config saved to /var/cache/conftool/dbconfig/20260525-075708-fceratto.json [07:57:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2215.codfw.wmnet with reason: Maintenance [07:57:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2215 (T426633)', diff saved to https://phabricator.wikimedia.org/P92838 and previous config saved to /var/cache/conftool/dbconfig/20260525-075739-fceratto.json [08:04:47] (03PS11) 10Arnaudb: vrts: Create test role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1178874 (https://phabricator.wikimedia.org/T378028) (owner: 10AOkoth) [08:04:47] (03CR) 10Arnaudb: "I rebased the change, it should be now possible to use rspamd for that new role" [puppet] - 10https://gerrit.wikimedia.org/r/1178874 (https://phabricator.wikimedia.org/T378028) (owner: 10AOkoth) [08:04:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2215 (T426633)', diff saved to https://phabricator.wikimedia.org/P92839 and previous config saved to /var/cache/conftool/dbconfig/20260525-080448-fceratto.json [08:05:45] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11951704 (10ABran-WMF) [08:10:14] (03CR) 10Federico Ceratto: sre.mysql.upgrade: support multiinstance hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [08:14:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2215', diff saved to https://phabricator.wikimedia.org/P92840 and previous config saved to /var/cache/conftool/dbconfig/20260525-081456-fceratto.json [08:25:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2215', diff saved to https://phabricator.wikimedia.org/P92841 and previous config saved to /var/cache/conftool/dbconfig/20260525-082504-fceratto.json [08:35:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2215 (T426633)', diff saved to https://phabricator.wikimedia.org/P92842 and previous config saved to /var/cache/conftool/dbconfig/20260525-083511-fceratto.json [08:35:33] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2231.codfw.wmnet with reason: Maintenance [08:35:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2231 (T426633)', diff saved to https://phabricator.wikimedia.org/P92843 and previous config saved to /var/cache/conftool/dbconfig/20260525-083540-fceratto.json [08:38:46] (03PS1) 10Filippo Giunchedi: cloud: scrape zk metrics [puppet] - 10https://gerrit.wikimedia.org/r/1293076 (https://phabricator.wikimedia.org/T422646) [08:41:42] (03CR) 10Filippo Giunchedi: [C:03+2] cloud: scrape zk metrics [puppet] - 10https://gerrit.wikimedia.org/r/1293076 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [08:42:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231 (T426633)', diff saved to https://phabricator.wikimedia.org/P92844 and previous config saved to /var/cache/conftool/dbconfig/20260525-084239-fceratto.json [08:46:26] (03PS1) 10Filippo Giunchedi: prometheus: disambiguate zk cloud prometheus job [puppet] - 10https://gerrit.wikimedia.org/r/1293077 (https://phabricator.wikimedia.org/T422646) [08:48:49] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: disambiguate zk cloud prometheus job [puppet] - 10https://gerrit.wikimedia.org/r/1293077 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [08:51:11] (03CR) 10Federico Ceratto: sre.mysql.upgrade: support multiinstance hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [08:52:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231', diff saved to https://phabricator.wikimedia.org/P92845 and previous config saved to /var/cache/conftool/dbconfig/20260525-085247-fceratto.json [09:02:46] (03CR) 10Filippo Giunchedi: [C:03+2] designate: use zk backend in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1287822 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [09:02:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231', diff saved to https://phabricator.wikimedia.org/P92846 and previous config saved to /var/cache/conftool/dbconfig/20260525-090255-fceratto.json [09:02:58] (03PS2) 10Filippo Giunchedi: designate: use zk backend in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1287822 (https://phabricator.wikimedia.org/T422646) [09:03:04] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] designate: use zk backend in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1287822 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [09:08:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#11951905 (10ayounsi) FYI there is a rename cookbook: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [09:12:27] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:13:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231 (T426633)', diff saved to https://phabricator.wikimedia.org/P92847 and previous config saved to /var/cache/conftool/dbconfig/20260525-091302-fceratto.json [09:13:48] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:15:10] (03PS1) 10Filippo Giunchedi: prometheus: use standard cloud zk jmx_exporter arguments [puppet] - 10https://gerrit.wikimedia.org/r/1293078 (https://phabricator.wikimedia.org/T422646) [09:15:12] (03CR) 10Hnowlan: [C:03+1] performance.w.o: add http blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1291950 (https://phabricator.wikimedia.org/T425299) (owner: 10Tiziano Fogli) [09:17:14] (03PS3) 10Elukey: WIP: sre.hosts.provision: introduce the wmfroot user [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) [09:17:33] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:17:59] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: use standard cloud zk jmx_exporter arguments [puppet] - 10https://gerrit.wikimedia.org/r/1293078 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [09:20:40] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:26] (03CR) 10Hnowlan: [C:03+2] prometheus, thanos: move recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [09:28:27] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:30:14] (03CR) 10FNegri: "@fceratto@wikimedia.org I thought a bit about this over the weekend. Given that we have to extend this test in the follow-up patches (see " [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [09:35:03] (03PS1) 10Filippo Giunchedi: prometheus: add jmx exporter jobs to cloud [puppet] - 10https://gerrit.wikimedia.org/r/1293079 [09:37:11] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: add jmx exporter jobs to cloud [puppet] - 10https://gerrit.wikimedia.org/r/1293079 (owner: 10Filippo Giunchedi) [09:39:37] (03CR) 10Arnaudb: [C:03+1] "looks good to me, the directory is empty on primary. related to this, we recently created https://phabricator.wikimedia.org/T423253" [puppet] - 10https://gerrit.wikimedia.org/r/1193832 (owner: 10Hashar) [09:39:51] (03PS4) 10Elukey: WIP: sre.hosts.provision: introduce the wmfroot user [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) [09:40:10] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:40:35] (03CR) 10Mszwarc: "I see it was done in Ifd40456fbbfc288b9dd5fc4e4b5c951ec52a79c1, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182798 (https://phabricator.wikimedia.org/T280532) (owner: 10Mszwarc) [09:40:35] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:42:49] (03CR) 10Arnaudb: [C:03+2] mailman: add UpstreamTlsContext on tlsproxy::envoy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219770 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:43:59] (03Abandoned) 10Arnaudb: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [09:45:51] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:46:24] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:48:29] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:49:14] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1006.eqiad.wmnet [09:57:49] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:59:18] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1006.eqiad.wmnet [09:59:22] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1007.eqiad.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T1000) [10:01:36] (03CR) 10Tiziano Fogli: [C:03+2] performance.w.o: add http blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1291950 (https://phabricator.wikimedia.org/T425299) (owner: 10Tiziano Fogli) [10:02:23] (03PS1) 10Hnowlan: prometheus: add deployment label to appservers RED recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1293080 (https://phabricator.wikimedia.org/T249663) [10:08:43] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1007.eqiad.wmnet [10:08:47] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1011.eqiad.wmnet [10:16:43] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1011.eqiad.wmnet [10:17:51] (03CR) 10Effie Mouzeli: "similar to Ic3837460ba88900b979ee846067a419f2d0a8061" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285341 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:18:04] (03PS5) 10Elukey: WIP: sre.hosts.provision: introduce the wmfroot user [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) [10:18:06] (03PS2) 10Effie Mouzeli: changeprop-jobqueue: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285341 (https://phabricator.wikimedia.org/T419976) [10:18:40] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:19:17] 10SRE-swift-storage, 06Commons: Commons file not found - https://phabricator.wikimedia.org/T427188 (10Jeff_G) 03NEW [10:22:30] (03PS2) 10Hnowlan: prometheus: add deployment label to appservers RED recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1293080 (https://phabricator.wikimedia.org/T249663) [10:25:17] (03CR) 10Effie Mouzeli: [C:03+2] changeprop-jobqueue: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285341 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:26:24] 10SRE-swift-storage, 06Commons: Commons file not found - File:UCB Latin Extended-G.png - https://phabricator.wikimedia.org/T427188#11952023 (10Peachey88) [10:26:25] (03PS1) 10Marostegui: pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293084 (https://phabricator.wikimedia.org/T418973) [10:27:44] (03Merged) 10jenkins-bot: changeprop-jobqueue: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285341 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:27:56] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:28:05] 10SRE-swift-storage, 06Commons: Commons file not found - File:UCB Latin Extended-G.png - https://phabricator.wikimedia.org/T427188#11952031 (10Jeff_G) I restored original file description page. [10:30:19] (03CR) 10Marostegui: [C:03+2] pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293084 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [10:30:22] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [10:31:32] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [10:33:52] 10SRE-swift-storage, 06Commons: Commons file not found - File:UCB Latin Extended-G.png - https://phabricator.wikimedia.org/T427188#11952064 (10Jeff_G) I could not find the filename in Verdy p's upload log, perhaps the file itself was never uploaded. [10:34:04] (03PS1) 10Marostegui: pc1023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293085 (https://phabricator.wikimedia.org/T418973) [10:35:18] (03CR) 10Marostegui: [C:03+2] pc1023: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1293085 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [10:37:20] (03PS1) 10Marostegui: instances.yaml: Add pc1023 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1293086 (https://phabricator.wikimedia.org/T418973) [10:38:27] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add pc1023 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1293086 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [10:39:37] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:39:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T426633)', diff saved to https://phabricator.wikimedia.org/P92848 and previous config saved to /var/cache/conftool/dbconfig/20260525-103944-fceratto.json [10:40:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc1023 to dbctl', diff saved to https://phabricator.wikimedia.org/P92849 and previous config saved to /var/cache/conftool/dbconfig/20260525-104027-marostegui.json [10:40:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc1023 to pc3 as master T418973', diff saved to https://phabricator.wikimedia.org/P92850 and previous config saved to /var/cache/conftool/dbconfig/20260525-104055-marostegui.json [10:41:00] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [10:41:00] (03PS1) 10Blake: Update to kubernetes v1.31.14. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1293087 (https://phabricator.wikimedia.org/T427065) [10:41:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc3 T418973', diff saved to https://phabricator.wikimedia.org/P92851 and previous config saved to /var/cache/conftool/dbconfig/20260525-104141-marostegui.json [10:43:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS trixie [10:44:00] (03PS1) 10Marostegui: pc1023: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1293088 [10:44:39] (03CR) 10Marostegui: [C:03+2] pc1023: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1293088 (owner: 10Marostegui) [10:46:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T426633)', diff saved to https://phabricator.wikimedia.org/P92852 and previous config saved to /var/cache/conftool/dbconfig/20260525-104625-fceratto.json [10:56:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P92853 and previous config saved to /var/cache/conftool/dbconfig/20260525-105633-fceratto.json [10:56:49] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [10:57:30] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [10:57:31] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [10:58:11] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [11:00:56] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2160.codfw.wmnet with reason: host reimage [11:05:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2160.codfw.wmnet with reason: host reimage [11:06:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P92854 and previous config saved to /var/cache/conftool/dbconfig/20260525-110640-fceratto.json [11:16:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T426633)', diff saved to https://phabricator.wikimedia.org/P92855 and previous config saved to /var/cache/conftool/dbconfig/20260525-111648-fceratto.json [11:17:10] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [11:17:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T426633)', diff saved to https://phabricator.wikimedia.org/P92856 and previous config saved to /var/cache/conftool/dbconfig/20260525-111717-fceratto.json [11:24:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T426633)', diff saved to https://phabricator.wikimedia.org/P92857 and previous config saved to /var/cache/conftool/dbconfig/20260525-112411-fceratto.json [11:24:16] 06SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Luvo - https://phabricator.wikimedia.org/T332214#11952181 (10LDlulisa-WMF) 05Resolved→03Open [11:26:01] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: retire mc1037-mc1054 [puppet] - 10https://gerrit.wikimedia.org/r/1289287 (https://phabricator.wikimedia.org/T426303) (owner: 10Effie Mouzeli) [11:26:08] (03PS1) 10Marostegui: installserver: Do not format pc1023 [puppet] - 10https://gerrit.wikimedia.org/r/1293092 (https://phabricator.wikimedia.org/T418973) [11:26:38] (03CR) 10Majavah: [C:03+2] gitlab: Fix type/key params for sshkey resource [puppet] - 10https://gerrit.wikimedia.org/r/1292190 (https://phabricator.wikimedia.org/T427094) (owner: 10Majavah) [11:27:38] (03PS3) 10Aklapper: offboard-user: Replace Conduit API user.query with user.search call [puppet] - 10https://gerrit.wikimedia.org/r/1292892 (https://phabricator.wikimedia.org/T420324) [11:27:58] (03CR) 10Aklapper: "Done, thank you again!" [puppet] - 10https://gerrit.wikimedia.org/r/1292892 (https://phabricator.wikimedia.org/T420324) (owner: 10Aklapper) [11:28:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2160.codfw.wmnet with OS trixie [11:28:15] 06SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Luvo - https://phabricator.wikimedia.org/T332214#11952195 (10LDlulisa-WMF) [11:28:32] 06SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Luvo - https://phabricator.wikimedia.org/T332214#11952197 (10LDlulisa-WMF) [11:34:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P92858 and previous config saved to /var/cache/conftool/dbconfig/20260525-113419-fceratto.json [11:43:49] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1037.eqiad.wmnet [11:44:06] (03PS1) 10Mszwarc: Update plwikimedia logo to monochrome, following on-wiki change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293094 (https://phabricator.wikimedia.org/T427193) [11:44:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P92859 and previous config saved to /var/cache/conftool/dbconfig/20260525-114426-fceratto.json [11:46:42] jouncebot: now [11:46:42] No deployments scheduled for the next 1 hour(s) and 13 minute(s) [11:46:47] jouncebot: next [11:46:48] In 1 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T1300) [11:46:56] jiji@cumin1003 decommission (PID 4179256) is awaiting input [11:47:40] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:54] (03CR) 10Marostegui: [C:03+2] installserver: Do not format pc1023 [puppet] - 10https://gerrit.wikimedia.org/r/1293092 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [11:48:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293094 (https://phabricator.wikimedia.org/T427193) (owner: 10Mszwarc) [11:50:40] jiji@cumin1003 decommission (PID 4179256) is awaiting input [11:52:38] (03PS1) 10Jon Harald Søby: Update logo, wordmark and tagline for zghwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290953 (https://phabricator.wikimedia.org/T426406) [11:52:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290953 (https://phabricator.wikimedia.org/T426406) (owner: 10Jon Harald Søby) [11:53:13] (03PS1) 10Effie Mouzeli: changeprop-jobqueue: codfw: replace rdb2007 with rdb2011 (Redis 8) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293095 (https://phabricator.wikimedia.org/T419976) [11:54:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T426633)', diff saved to https://phabricator.wikimedia.org/P92860 and previous config saved to /var/cache/conftool/dbconfig/20260525-115434-fceratto.json [11:54:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:55:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T426633)', diff saved to https://phabricator.wikimedia.org/P92861 and previous config saved to /var/cache/conftool/dbconfig/20260525-115504-fceratto.json [11:57:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:58] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [12:01:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T426633)', diff saved to https://phabricator.wikimedia.org/P92862 and previous config saved to /var/cache/conftool/dbconfig/20260525-120145-fceratto.json [12:02:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:40] jiji@cumin1003 decommission (PID 4179256) is awaiting input [12:10:48] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1037.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [12:11:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P92863 and previous config saved to /var/cache/conftool/dbconfig/20260525-121153-fceratto.json [12:13:53] jiji@cumin1003 decommission (PID 4179256) is awaiting input [12:14:38] (03PS1) 10Aklapper: Log AVA account disabling in the user account management feed [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1293101 (https://phabricator.wikimedia.org/T426972) [12:17:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1037.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [12:17:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1037.eqiad.wmnet [12:17:12] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1293102 (owner: 10L10n-bot) [12:19:02] (03CR) 10HakanIST: [C:03+1] "Verified the migration with `python-phabricator` against the Wikimedia Phabricator API. Both cases work as expected. LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1292892 (https://phabricator.wikimedia.org/T420324) (owner: 10Aklapper) [12:20:12] jiji@cumin1003 decommission (PID 29420) is awaiting input [12:22:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P92864 and previous config saved to /var/cache/conftool/dbconfig/20260525-122201-fceratto.json [12:23:06] (03CR) 10Mszwarc: Modify various configurations for English Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [12:24:16] (03PS1) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) [12:27:57] Deploying cxserver.. [12:31:09] (03CR) 10Tiziano Fogli: prometheus: add deployment label to appservers RED recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293080 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [12:32:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T426633)', diff saved to https://phabricator.wikimedia.org/P92865 and previous config saved to /var/cache/conftool/dbconfig/20260525-123208-fceratto.json [12:32:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [12:32:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2192 (T426633)', diff saved to https://phabricator.wikimedia.org/P92866 and previous config saved to /var/cache/conftool/dbconfig/20260525-123239-fceratto.json [12:33:20] (03CR) 10VadymTS1: Modify various configurations for English Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [12:34:53] jiji@cumin1003 decommission (PID 29420) is awaiting input [12:36:06] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-05-24-103047-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292915 (https://phabricator.wikimedia.org/T426808) (owner: 10KartikMistry) [12:36:21] (03CR) 10Mszwarc: Modify various configurations for English Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [12:37:29] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1293107 (owner: 10L10n-bot) [12:38:12] (03Merged) 10jenkins-bot: Update cxserver to 2026-05-24-103047-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1292915 (https://phabricator.wikimedia.org/T426808) (owner: 10KartikMistry) [12:39:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T426633)', diff saved to https://phabricator.wikimedia.org/P92867 and previous config saved to /var/cache/conftool/dbconfig/20260525-123927-fceratto.json [12:39:43] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1038.eqiad.wmnet [12:39:53] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [12:40:15] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:43:25] jiji@cumin1003 decommission (PID 29420) is awaiting input [12:47:12] (03PS2) 10Sbisson: Article Guidance: enable experiment on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290813 (https://phabricator.wikimedia.org/T426871) [12:48:08] (03PS3) 10Sbisson: Article Guidance: enable experiment on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290813 (https://phabricator.wikimedia.org/T426871) [12:48:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290813 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [12:48:58] (03CR) 10Marostegui: "It looks good, but I'd put the following things on top to make it easier for future references:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [12:49:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P92868 and previous config saved to /var/cache/conftool/dbconfig/20260525-124934-fceratto.json [12:49:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Reboot [12:52:07] (03PS2) 10Tiziano Fogli: performance.w.o: restrict blackbox check to ip4 [puppet] - 10https://gerrit.wikimedia.org/r/1293091 (https://phabricator.wikimedia.org/T425299) [12:53:45] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:54:18] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:54:27] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1162: Reboot [12:54:35] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db1162: Reboot [12:55:53] (03CR) 10KartikMistry: [C:03+1] Article Guidance: enable experiment on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290813 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [12:56:12] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:56:44] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:58:52] !log Updated cxserver to 2026-05-24-103047-production (T426808, T373418) [12:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:58] T426808: Use node native sqlite in cxserver - https://phabricator.wikimedia.org/T426808 [12:58:58] T373418: Error response structure is not as documented - https://phabricator.wikimedia.org/T373418 [12:59:03] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1162: Reboot [12:59:10] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1162: Reboot [12:59:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P92870 and previous config saved to /var/cache/conftool/dbconfig/20260525-125942-fceratto.json [12:59:47] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1162: Reboot [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T1300). [13:00:05] VadymTS1, Msz2001, Jhs, and stephanebisson: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:13] o/ [13:00:14] I can deploy [13:00:19] o/ [13:00:31] o/ [13:00:39] VadymTS1: I'll go with your patch first [13:00:46] okay [13:01:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [13:02:12] (03Merged) 10jenkins-bot: Modify various configurations for English Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1291966 (https://phabricator.wikimedia.org/T426992) (owner: 10VadymTS1) [13:02:36] 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11952459 (10phaultfinder) [13:03:29] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1291966|Modify various configurations for English Wikibooks (T426992)]] [13:03:33] T426992: Modify various configurations for English Wikibooks - https://phabricator.wikimedia.org/T426992 [13:06:18] (03PS1) 10Arnaudb: conftool-data: geodns: add gitlab-addrs [puppet] - 10https://gerrit.wikimedia.org/r/1290677 (https://phabricator.wikimedia.org/T425441) [13:07:01] (03PS1) 10Arnaudb: dns.admin: add gitlab-addrs resource [cookbooks] - 10https://gerrit.wikimedia.org/r/1290676 (https://phabricator.wikimedia.org/T425441) [13:07:05] (03PS1) 10Arnaudb: conftool-data: add tcp-proxy gitlab service [puppet] - 10https://gerrit.wikimedia.org/r/1290729 (https://phabricator.wikimedia.org/T425441) [13:07:39] !log mszwarc@deploy1003 vadymts1, mszwarc: Backport for [[gerrit:1291966|Modify various configurations for English Wikibooks (T426992)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:48] cheking [13:08:18] (03PS1) 10Arnaudb: trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [13:09:47] (03PS5) 10Arnaudb: service: add gitlab-https and gitlab-ssh service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) [13:09:47] (03CR) 10Arnaudb: "I tried to mimic the gerrit rollout → starting with everything in service_setup and then each new patch from the relation chain activates " [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [13:09:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T426633)', diff saved to https://phabricator.wikimedia.org/P92872 and previous config saved to /var/cache/conftool/dbconfig/20260525-130950-fceratto.json [13:10:03] (03PS6) 10Arnaudb: lvs7003: add gitlab-ssh and gitlab-https [puppet] - 10https://gerrit.wikimedia.org/r/1291898 (https://phabricator.wikimedia.org/T425441) [13:10:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [13:10:23] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [13:10:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T426633)', diff saved to https://phabricator.wikimedia.org/P92873 and previous config saved to /var/cache/conftool/dbconfig/20260525-131023-fceratto.json [13:10:26] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1291898 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [13:12:28] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [13:12:42] alls good [13:12:47] !log mszwarc@deploy1003 vadymts1, mszwarc: Continuing with deployment [13:15:07] Once this is deployed, I'll deploy both logo patches (mine and Jhs's) – let's hope there won't be any merge conflicts, at least shouldn't be :) [13:15:59] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293094 (https://phabricator.wikimedia.org/T427193) (owner: 10Mszwarc) [13:16:13] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290953 (https://phabricator.wikimedia.org/T426406) (owner: 10Jon Harald Søby) [13:17:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T426633)', diff saved to https://phabricator.wikimedia.org/P92875 and previous config saved to /var/cache/conftool/dbconfig/20260525-131714-fceratto.json [13:18:12] jiji@cumin1003 decommission (PID 29420) is awaiting input [13:18:34] (03Merged) 10jenkins-bot: Update plwikimedia logo to monochrome, following on-wiki change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293094 (https://phabricator.wikimedia.org/T427193) (owner: 10Mszwarc) [13:18:37] (03Merged) 10jenkins-bot: Update logo, wordmark and tagline for zghwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290953 (https://phabricator.wikimedia.org/T426406) (owner: 10Jon Harald Søby) [13:19:22] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1291966|Modify various configurations for English Wikibooks (T426992)]] (duration: 15m 53s) [13:19:27] T426992: Modify various configurations for English Wikibooks - https://phabricator.wikimedia.org/T426992 [13:20:08] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1293094|Update plwikimedia logo to monochrome, following on-wiki change (T427193)]], [[gerrit:1290953|Update logo, wordmark and tagline for zghwiki (T426406)]] [13:20:14] T427193: Update plwikimedia logo to use black Wikimedia logo, instead of red-blue-green one - https://phabricator.wikimedia.org/T427193 [13:20:14] T426406: Update logo, wordmark and tagline for zghwiki - https://phabricator.wikimedia.org/T426406 [13:20:19] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1038.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [13:20:40] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:53] 06SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Luvo - https://phabricator.wikimedia.org/T332214#11952502 (10LDlulisa-WMF) [13:21:04] 06SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Luvo - https://phabricator.wikimedia.org/T332214#11952503 (10LDlulisa-WMF) 05Open→03Resolved [13:21:50] !log mszwarc@deploy1003 mszwarc, jhsoby: Backport for [[gerrit:1293094|Update plwikimedia logo to monochrome, following on-wiki change (T427193)]], [[gerrit:1290953|Update logo, wordmark and tagline for zghwiki (T426406)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:22:18] Jhs: please check :) [13:22:35] 06SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for Prabhat - https://phabricator.wikimedia.org/T332214#11952513 (10LDlulisa-WMF) [13:23:05] 06SRE, 10SRE-Access-Requests: Requesting Access to Analytics Data Lake - https://phabricator.wikimedia.org/T427197 (10LDlulisa-WMF) 03NEW [13:23:23] jiji@cumin1003 decommission (PID 29420) is awaiting input [13:23:30] Msz2001, LGTM! [13:23:37] !log mszwarc@deploy1003 mszwarc, jhsoby: Continuing with deployment [13:24:34] stephanebisson: Will you deploy yourself or should I deploy your patch as well? [13:24:41] I can do it [13:25:01] Okay [13:26:22] (03PS1) 10VadymTS1: Set $wgAutoconfirmCount to 25 on plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293119 (https://phabricator.wikimedia.org/T427177) [13:27:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P92876 and previous config saved to /var/cache/conftool/dbconfig/20260525-132722-fceratto.json [13:27:46] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293094|Update plwikimedia logo to monochrome, following on-wiki change (T427193)]], [[gerrit:1290953|Update logo, wordmark and tagline for zghwiki (T426406)]] (duration: 07m 43s) [13:27:51] T427193: Update plwikimedia logo to use black Wikimedia logo, instead of red-blue-green one - https://phabricator.wikimedia.org/T427193 [13:27:52] T426406: Update logo, wordmark and tagline for zghwiki - https://phabricator.wikimedia.org/T426406 [13:27:55] stephanebisson: Over to you [13:28:04] On it, thanks [13:28:19] (purged the logos, in the meantime) [13:28:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290813 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [13:29:14] 06SRE, 10SRE-Access-Requests: Requesting Access to Analytics Data Lake - https://phabricator.wikimedia.org/T427197#11952543 (10LDlulisa-WMF) [13:29:53] (03Merged) 10jenkins-bot: Article Guidance: enable experiment on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290813 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [13:30:10] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1290813|Article Guidance: enable experiment on phase 2 wikis (T426871)]] [13:30:15] T426871: Enable AG experiment on phase 2 wikis - https://phabricator.wikimedia.org/T426871 [13:31:15] VadymTS1: Would you like to have the plwiktionary patch deployed in this window? [13:31:36] If this possible yes [13:31:52] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1290813|Article Guidance: enable experiment on phase 2 wikis (T426871)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:32:08] I can do that after the current patch is done (stephanebisson please ping me once you're done) [13:32:21] will do [13:32:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293119 (https://phabricator.wikimedia.org/T427177) (owner: 10VadymTS1) [13:32:41] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1038.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [13:32:41] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:32:42] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1038.eqiad.wmnet [13:32:51] I added this change to this window [13:33:04] Deploying rec-api.. [13:33:05] ack [13:33:14] !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:34:14] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:35:51] 06SRE, 10SRE-Access-Requests: Requesting Access to Analytics Data Lake - https://phabricator.wikimedia.org/T427197#11952556 (10LDlulisa-WMF) [13:36:47] jiji@cumin1003 decommission (PID 128988) is awaiting input [13:37:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P92878 and previous config saved to /var/cache/conftool/dbconfig/20260525-133729-fceratto.json [13:38:04] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqsin [13:38:19] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqsin [13:38:24] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290813|Article Guidance: enable experiment on phase 2 wikis (T426871)]] (duration: 08m 14s) [13:38:28] T426871: Enable AG experiment on phase 2 wikis - https://phabricator.wikimedia.org/T426871 [13:39:29] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqiad [13:39:42] Msz2001 back to you [13:39:47] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqiad [13:39:47] on it, thanks [13:40:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293119 (https://phabricator.wikimedia.org/T427177) (owner: 10VadymTS1) [13:40:36] (03CR) 10Ssingh: [C:03+2] ml-serve(grpc): step 1, etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:41:13] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [13:41:39] yeah it's down [13:41:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:04] (03Merged) 10jenkins-bot: Set $wgAutoconfirmCount to 25 on plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293119 (https://phabricator.wikimedia.org/T427177) (owner: 10VadymTS1) [13:43:07] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 115660 bytes in 3.386 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [13:43:46] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1293119|Set $wgAutoconfirmCount to 25 on plwiktionary (T427177)]] [13:43:50] T427177: Set $wgAutoconfirmCount to 25 on plwiktionary - https://phabricator.wikimedia.org/T427177 [13:43:57] (03CR) 10Ssingh: [C:03+2] ml-serve(grpc): step 2, add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:45:13] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1162: Reboot [13:45:31] !log mszwarc@deploy1003 vadymts1, mszwarc: Backport for [[gerrit:1293119|Set $wgAutoconfirmCount to 25 on plwiktionary (T427177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:45:40] cheking [13:46:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:04] (03PS4) 10Dpogorzelski: ml-serve(grpc): step 3, add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) [13:47:04] (03PS1) 10Dpogorzelski: ml-serve(grpc): step 4, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293120 (https://phabricator.wikimedia.org/T424049) [13:47:36] (03CR) 10Ssingh: [C:03+2] ml-serve(grpc): step 3, add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:47:37] (03CR) 10Ssingh: [V:03+2 C:03+2] ml-serve(grpc): step 3, add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:47:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T426633)', diff saved to https://phabricator.wikimedia.org/P92880 and previous config saved to /var/cache/conftool/dbconfig/20260525-134737-fceratto.json [13:47:41] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:47:46] (03CR) 10Ssingh: [V:03+2 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:47:48] All good [13:47:55] !log mszwarc@deploy1003 vadymts1, mszwarc: Continuing with deployment [13:48:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2223.codfw.wmnet with reason: Maintenance [13:48:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T426633)', diff saved to https://phabricator.wikimedia.org/P92881 and previous config saved to /var/cache/conftool/dbconfig/20260525-134807-fceratto.json [13:48:17] (03CR) 10Ssingh: [V:03+2 C:03+2] "Didn't mean to submit before CI finished but I do so in error. Running recheck again." [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:48:28] (03PS1) 10Dpogorzelski: ml-serve(grpc): step 5, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293121 (https://phabricator.wikimedia.org/T424049) [13:48:57] (03CR) 10CI reject: [V:04-1] ml-serve(grpc): step 5, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293121 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:49:00] !log Updated Recommendation API to 2026-05-21-044522-production [13:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:23] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5017.eqsin.wmnet [13:50:33] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5025.eqsin.wmnet [13:50:45] (03CR) 10Ssingh: [C:03+1] ml-serve(grpc): step 4, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293120 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:51:33] sudo cumin 'A:lvs and (A:eqiad or A:codfw)' 'disable-puppet "adding new ml-serve (grpc) T424049"' [13:51:33] T424049: k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049 [13:51:35] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1100.eqiad.wmnet [13:51:42] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1101.eqiad.wmnet [13:52:03] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293119|Set $wgAutoconfirmCount to 25 on plwiktionary (T427177)]] (duration: 09m 43s) [13:52:08] T427177: Set $wgAutoconfirmCount to 25 on plwiktionary - https://phabricator.wikimedia.org/T427177 [13:52:10] (03CR) 10Ssingh: [C:03+2] ml-serve(grpc): step 4, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293120 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:52:19] thanks [13:52:27] !log Everything deployed, UTC afternoon config+backport window done [13:52:28] (03PS2) 10Dpogorzelski: ml-serve(grpc): step 4, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293120 (https://phabricator.wikimedia.org/T424049) [13:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:36] yw and thanks for the patches :) [13:53:17] (03CR) 10Ssingh: [C:03+2] ml-serve(grpc): step 4, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293120 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:54:02] (03PS2) 10Dpogorzelski: ml-serve(grpc): step 5, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293121 (https://phabricator.wikimedia.org/T424049) [13:54:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T426633)', diff saved to https://phabricator.wikimedia.org/P92882 and previous config saved to /var/cache/conftool/dbconfig/20260525-135458-fceratto.json [13:56:13] (03CR) 10Marostegui: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [13:57:47] sudo cumin 'A:lvs and A:eqiad' 'run-puppet-agent --enable "adding new ml-serve (grpc) T424049": NOOP change, since service is codfw only [13:57:48] T424049: k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049 [13:57:57] I forgot a log didn'tI [13:57:59] !log sudo cumin 'A:lvs and A:eqiad' 'run-puppet-agent --enable "adding new ml-serve (grpc) T424049": NOOP change, since service is codfw only [13:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:22] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1039.eqiad.wmnet [14:00:16] !log sudo cumin 'A:lvs and A:lvs-secondary-codfw' 'run-puppet-agent --enable "adding new ml-serve (grpc) T424049"' [14:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:05] !log sukhe@lvs2014:~$ sudo systemctl restart pybal.service [14:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:18] !log sukhe@lvs2014:~$ sudo systemctl restart pybal.service": T424049 [14:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:58] jiji@cumin1003 decommission (PID 128988) is awaiting input [14:03:54] !log sudo cumin 'A:lvs and A:lvs-low-traffic-codfw' 'run-puppet-agent --enable "adding new ml-serve (grpc) T424049"' [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:58] T424049: k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049 [14:05:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P92884 and previous config saved to /var/cache/conftool/dbconfig/20260525-140505-fceratto.json [14:05:11] !log sukhe@lvs2013:~$ sudo systemctl restart pybal.service: T424049 [14:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:15] !log curl localhost:9090/pools/inference-staging-grpc_30051 shows ml-staging200[1-3].codfw.wmnet as enabled and pooled: T424049 [14:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:47] (03CR) 10Ssingh: [C:03+2] ml-serve(grpc): step 5, change lvs state [puppet] - 10https://gerrit.wikimedia.org/r/1293121 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [14:10:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:19] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [14:15:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P92885 and previous config saved to /var/cache/conftool/dbconfig/20260525-141513-fceratto.json [14:15:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T426689) (owner: 10HakanIST) [14:16:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:18:07] jiji@cumin1003 decommission (PID 128988) is awaiting input [14:21:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T426633)', diff saved to https://phabricator.wikimedia.org/P92887 and previous config saved to /var/cache/conftool/dbconfig/20260525-142520-fceratto.json [14:25:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2228.codfw.wmnet with reason: Maintenance [14:25:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T426633)', diff saved to https://phabricator.wikimedia.org/P92888 and previous config saved to /var/cache/conftool/dbconfig/20260525-142551-fceratto.json [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T1430) [14:31:37] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1103.eqiad.wmnet [14:32:37] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5018.eqsin.wmnet [14:32:46] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5026.eqsin.wmnet [14:32:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T426633)', diff saved to https://phabricator.wikimedia.org/P92889 and previous config saved to /var/cache/conftool/dbconfig/20260525-143246-fceratto.json [14:33:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:33:14] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1102.eqiad.wmnet [14:41:03] (03PS1) 10Herron: grafana: increase render workers [puppet] - 10https://gerrit.wikimedia.org/r/1293131 [14:42:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P92890 and previous config saved to /var/cache/conftool/dbconfig/20260525-144253-fceratto.json [14:48:42] PROBLEM - ganeti-noded running on ganeti1058 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:49:42] RECOVERY - ganeti-noded running on ganeti1058 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:53:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P92891 and previous config saved to /var/cache/conftool/dbconfig/20260525-145301-fceratto.json [14:56:53] (03PS1) 10Herron: grafana: set grid as default report layout [puppet] - 10https://gerrit.wikimedia.org/r/1293133 [14:57:41] (03CR) 10Tiziano Fogli: [C:03+1] grafana: set grid as default report layout [puppet] - 10https://gerrit.wikimedia.org/r/1293133 (owner: 10Herron) [14:57:56] (03CR) 10Herron: [C:03+2] grafana: set grid as default report layout [puppet] - 10https://gerrit.wikimedia.org/r/1293133 (owner: 10Herron) [15:03:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T426633)', diff saved to https://phabricator.wikimedia.org/P92892 and previous config saved to /var/cache/conftool/dbconfig/20260525-150309-fceratto.json [15:08:17] (03PS1) 10Filippo Giunchedi: pontoon: fix sssd_filter_users / sssd_filter_groups [puppet] - 10https://gerrit.wikimedia.org/r/1293135 [15:09:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:29] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:52] (03CR) 10Ssingh: [C:03+1] "🚢 it" [puppet] - 10https://gerrit.wikimedia.org/r/1289997 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:10:54] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix sssd_filter_users / sssd_filter_groups [puppet] - 10https://gerrit.wikimedia.org/r/1293135 (owner: 10Filippo Giunchedi) [15:11:26] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1105.eqiad.wmnet [15:12:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:59] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1104.eqiad.wmnet [15:14:14] (03PS1) 10Marostegui: instances.yaml: Remove pc1013 [puppet] - 10https://gerrit.wikimedia.org/r/1293136 (https://phabricator.wikimedia.org/T427190) [15:14:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:00] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5027.eqsin.wmnet [15:15:01] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5019.eqsin.wmnet [15:15:27] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove pc1013 [puppet] - 10https://gerrit.wikimedia.org/r/1293136 (https://phabricator.wikimedia.org/T427190) (owner: 10Marostegui) [15:17:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove pc1013 from dbctl T427190', diff saved to https://phabricator.wikimedia.org/P92893 and previous config saved to /var/cache/conftool/dbconfig/20260525-151718-marostegui.json [15:17:23] T427190: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190 [15:19:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:25] RESOLVED: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:49] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1039.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [15:29:04] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1039.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [15:29:04] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1039.eqiad.wmnet [15:29:28] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1040.eqiad.wmnet [15:30:05] jan_drewniak: That opportune time for a Wikimedia Portals Update deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T1530). [15:33:03] jiji@cumin1003 decommission (PID 275190) is awaiting input [15:33:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:41:45] (03PS1) 10Sbisson: Enable AG experiment on phase 2 batch 2 wikis: ar, bn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293139 (https://phabricator.wikimedia.org/T426871) [15:51:10] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1107.eqiad.wmnet [15:52:45] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1106.eqiad.wmnet [15:57:23] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5028.eqsin.wmnet [15:57:28] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5020.eqsin.wmnet [15:59:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2249.codfw.wmnet with reason: Maintenance [15:59:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2249 (T426633)', diff saved to https://phabricator.wikimedia.org/P92894 and previous config saved to /var/cache/conftool/dbconfig/20260525-155930-fceratto.json [16:02:10] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [16:02:39] (03PS2) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) [16:04:40] (03CR) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [16:04:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2249 (T426633)', diff saved to https://phabricator.wikimedia.org/P92895 and previous config saved to /var/cache/conftool/dbconfig/20260525-160450-fceratto.json [16:07:49] jiji@cumin1003 decommission (PID 275190) is awaiting input [16:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:29] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:14] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2249', diff saved to https://phabricator.wikimedia.org/P92896 and previous config saved to /var/cache/conftool/dbconfig/20260525-161457-fceratto.json [16:16:35] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1040.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [16:17:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:40] jiji@cumin1003 decommission (PID 275190) is awaiting input [16:20:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1040.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [16:20:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1040.eqiad.wmnet [16:20:35] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1041.eqiad.wmnet [16:25:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2249', diff saved to https://phabricator.wikimedia.org/P92897 and previous config saved to /var/cache/conftool/dbconfig/20260525-162505-fceratto.json [16:26:36] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [16:30:59] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1109.eqiad.wmnet [16:32:16] jiji@cumin1003 decommission (PID 307288) is awaiting input [16:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:31] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1108.eqiad.wmnet [16:35:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2249 (T426633)', diff saved to https://phabricator.wikimedia.org/P92898 and previous config saved to /var/cache/conftool/dbconfig/20260525-163512-fceratto.json [16:35:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:36:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T426633)', diff saved to https://phabricator.wikimedia.org/P92899 and previous config saved to /var/cache/conftool/dbconfig/20260525-163559-fceratto.json [16:39:29] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5029.eqsin.wmnet [16:40:08] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5021.eqsin.wmnet [16:41:18] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1041.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [16:42:16] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1041.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [16:42:16] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:42:17] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1041.eqiad.wmnet [16:42:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T426633)', diff saved to https://phabricator.wikimedia.org/P92900 and previous config saved to /var/cache/conftool/dbconfig/20260525-164247-fceratto.json [16:44:36] (03PS1) 10Hnowlan: tests/integration: readability improvements [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1293147 (https://phabricator.wikimedia.org/T385798) [16:45:19] jiji@cumin1003 decommission (PID 321768) is awaiting input [16:51:14] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1042.eqiad.wmnet [16:52:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P92901 and previous config saved to /var/cache/conftool/dbconfig/20260525-165255-fceratto.json [16:54:51] jiji@cumin1003 decommission (PID 321768) is awaiting input [16:55:14] (03PS2) 10Hnowlan: tests/integration: readability improvements [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1293147 (https://phabricator.wikimedia.org/T385798) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T1700) [17:00:05] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T1700). [17:03:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P92902 and previous config saved to /var/cache/conftool/dbconfig/20260525-170302-fceratto.json [17:04:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:05:29] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:06:39] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [17:11:02] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1042.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [17:13:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T426633)', diff saved to https://phabricator.wikimedia.org/P92903 and previous config saved to /var/cache/conftool/dbconfig/20260525-171310-fceratto.json [17:14:07] jiji@cumin1003 decommission (PID 321768) is awaiting input [17:18:50] sukhe@cumin1003 roll-reboot (PID 137856) is awaiting input [17:28:36] !log sukhe@alert1002:~$ sudo systemctl restart icinga.service [17:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:46] PROBLEM - Host cp1110 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:49] RECOVERY - Host cp1110 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [17:29:22] around, looking [17:29:31] see -sre-private [17:29:32] moritzm: should be recovering now, I restated it [17:29:42] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1111.eqiad.wmnet [17:29:44] ah, ok! [17:30:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1042.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [17:30:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:30:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1042.eqiad.wmnet [17:31:08] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1110.eqiad.wmnet [reason: manually pooling after reboot as icinga was down] [17:35:17] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1043.eqiad.wmnet [17:38:53] jiji@cumin1003 decommission (PID 327972) is awaiting input [18:00:51] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-text_eqsin [18:01:07] !log sre.cdn.roll-reboot cookbooks stalled due to icinga reboot [18:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:11] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-upload_eqsin [18:01:51] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-text_eqiad [18:02:42] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp5023*} and A:cp [18:03:15] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp1113*} and A:cp [18:09:22] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1113.eqiad.wmnet [18:09:22] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1113.eqiad.wmnet [18:09:22] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp1113*} and A:cp [18:10:25] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp5030*} and A:cp [18:10:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:49] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [18:15:07] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5023.eqsin.wmnet [18:15:07] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp5023*} and A:cp [18:16:29] jiji@cumin1003 decommission (PID 327972) is awaiting input [18:22:39] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5030.eqsin.wmnet [18:22:39] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp5030*} and A:cp [18:33:59] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5030.eqsin.wmnet [reason: manually pooling after reboot as icinga was down] [18:34:06] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet [reason: manually pooling after reboot as icinga was down] [18:49:18] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1115.eqiad.wmnet [18:49:18] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_eqiad [19:22:12] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1043.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [19:25:17] jiji@cumin1003 decommission (PID 327972) is awaiting input [19:25:33] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1043.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [19:25:33] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:25:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1043.eqiad.wmnet [19:28:36] jiji@cumin1003 decommission (PID 341799) is awaiting input [19:42:26] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:34] FIRING: DiskSpace: Disk space krb1002:9100:/ 0.9009% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:55:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:56:04] jiji@cumin1003 decommission (PID 341799) is awaiting input [19:57:19] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1044.eqiad.wmnet [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T2000) [20:00:05] HakanIST: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] hi [20:00:55] jiji@cumin1003 decommission (PID 341799) is awaiting input [20:06:50] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [20:08:35] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:12:14] (03PS1) 10Alex.sanford: Enforce 2FA requirements for phase 3 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) [20:12:28] jiji@cumin1003 decommission (PID 341799) is awaiting input [20:14:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [20:14:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293161 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [20:15:07] !log truncate krb5kdc.log1 (which made log rotation fail) [20:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:40] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1044.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [20:28:44] jiji@cumin1003 decommission (PID 341799) is awaiting input [20:35:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1044.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [20:37:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:37:33] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1044.eqiad.wmnet [20:38:18] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1045.eqiad.wmnet [20:49:20] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [20:54:59] jiji@cumin1003 decommission (PID 348610) is awaiting input [20:55:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:56:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:00:04] alexsanford, Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T2100). [21:00:43] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1045.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [21:03:47] jiji@cumin1003 decommission (PID 348610) is awaiting input [21:06:40] (03PS1) 10Kosta Harlan: hCaptcha: Complete rollout to all wikis (group2 + cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293167 (https://phabricator.wikimedia.org/T425354) [21:07:27] (03CR) 10CI reject: [V:04-1] hCaptcha: Complete rollout to all wikis (group2 + cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293167 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [21:12:27] (03PS2) 10Kosta Harlan: hCaptcha: Complete rollout to all wikis (group2 + cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293167 (https://phabricator.wikimedia.org/T425354) [21:12:27] (03PS3) 10Kosta Harlan: hCaptcha: Exempt CommunityRequests pages from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290055 (https://phabricator.wikimedia.org/T426897) [22:10:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:58] (03PS2) 10Aklapper: Log AVA account disabling in the user account management feed [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1293101 (https://phabricator.wikimedia.org/T426972) [23:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260525T2300) [23:05:04] (03PS1) 10Bartosz Dziewoński: Configure wgOAuthAutoApprove['protocols'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) [23:39:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1293176 [23:39:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1293176 (owner: 10TrainBranchBot) [23:50:27] (03PS1) 10Sbisson: Instrumentation: log new articles namespace and source [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293177 (https://phabricator.wikimedia.org/T422146) [23:51:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1293177 (https://phabricator.wikimedia.org/T422146) (owner: 10Sbisson) [23:51:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1293176 (owner: 10TrainBranchBot)