[00:30:00] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [00:31:47] (03PS1) 10Novem Linguae: InitialiseSettings: change enwiki extendedconfirmed autopromote settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307270 (https://phabricator.wikimedia.org/T431060) [00:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.49% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:56:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:57:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:12:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1307271 [01:12:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1307271 (owner: 10TrainBranchBot) [01:20:40] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:18] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1307271 (owner: 10TrainBranchBot) [02:00:32] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:51] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 07m 18s) [02:09:43] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:43] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:04] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:39:55] (03CR) 10Cwhite: [C:03+1] "Looks like it should do the trick! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1307087 (https://phabricator.wikimedia.org/T430149) (owner: 10Hnowlan) [03:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:42:39] (03CR) 10Cwhite: "PCC looks good: https://puppet-compiler.wmflabs.org/output/1305718/8916/" [puppet] - 10https://gerrit.wikimedia.org/r/1305718 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [03:45:08] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/1306969/8917/" [puppet] - 10https://gerrit.wikimedia.org/r/1306969 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [03:47:27] (03PS4) 10Cwhite: logstash: add parameters needed for security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305769 (https://phabricator.wikimedia.org/T350516) [03:54:07] (03PS5) 10Cwhite: logstash: add parameters needed for security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305769 (https://phabricator.wikimedia.org/T350516) [03:55:31] (03PS6) 10Cwhite: logstash: add parameters needed for security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305769 (https://phabricator.wikimedia.org/T350516) [03:57:20] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/1305769/8921/" [puppet] - 10https://gerrit.wikimedia.org/r/1305769 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [04:00:14] (03PS2) 10Cwhite: logstash: configure provisioning of admin certificate [puppet] - 10https://gerrit.wikimedia.org/r/1306001 (https://phabricator.wikimedia.org/T350516) [04:01:48] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/1306001/8923/" [puppet] - 10https://gerrit.wikimedia.org/r/1306001 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [04:03:12] (03CR) 10Cwhite: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1307176 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [04:32:56] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1016 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:33:56] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1016 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:45:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1228 crashed - https://phabricator.wikimedia.org/T430934#12083998 (10Marostegui) Thanks - I will take it from here! [04:56:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:57:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:20:40] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260703T0600) [06:11:25] (03CR) 10Gkyziridis: [C:03+2] ml-services: Qwen36-27b test CUDA graphs on 1013/1014 with raised timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307105 (owner: 10Gkyziridis) [06:13:33] (03Merged) 10jenkins-bot: ml-services: Qwen36-27b test CUDA graphs on 1013/1014 with raised timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307105 (owner: 10Gkyziridis) [06:15:58] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [06:28:25] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12084061 (10catherine.kelsey.wmde) Hey @Dzahn - we were told it was a valid group - see this thread: https://wikimedia.slack.com/archives/CSV483812/p1782995115560309?thread_... [06:32:31] (03CR) 10Arnaudb: [C:03+2] taskgen: allow profile_yaml to render templates [puppet] - 10https://gerrit.wikimedia.org/r/1306481 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [06:33:43] (03CR) 10Muehlenhoff: [C:03+2] Remove alerts for the mirror lag [alerts] - 10https://gerrit.wikimedia.org/r/1307176 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [06:42:18] 06SRE, 06Infrastructure-Foundations: Make airflow-wmde-ops managed in Bitu - https://phabricator.wikimedia.org/T431077 (10MoritzMuehlenhoff) 03NEW [06:44:34] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12084084 (10MoritzMuehlenhoff) >>! In T430961#12083751, @Dzahn wrote: > Hi, I'm afraid `airflow-wmde-ops` is not a valid group. It is? cn=airflow-wmde-ops has twelve member... [06:52:04] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260703T0700) [07:04:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:09:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:27:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (I had a look at what uses the mediawiki::cgroup class in Cumin and the only remaining user outside of the deployment servers i" [puppet] - 10https://gerrit.wikimedia.org/r/1307205 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [07:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:53] !log jmm@cumin2003 START - Cookbook sre.hosts.decommission for hosts mirror1001.wikimedia.org [08:16:34] (03PS1) 10Filippo Giunchedi: dumps-nfs: re-use dumps.w.o address [puppet] - 10https://gerrit.wikimedia.org/r/1307338 (https://phabricator.wikimedia.org/T411248) [08:17:01] (03CR) 10Filippo Giunchedi: "Publishing today though will merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/1307338 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:18:14] !log jmm@cumin2003 START - Cookbook sre.dns.netbox [08:18:36] (03PS1) 10Filippo Giunchedi: wikimedia.org: move dumps-nfs to dumps-lb [dns] - 10https://gerrit.wikimedia.org/r/1307339 (https://phabricator.wikimedia.org/T411248) [08:18:58] (03CR) 10Filippo Giunchedi: "Publishing today though will merge next week" [dns] - 10https://gerrit.wikimedia.org/r/1307339 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:20:43] (03CR) 10Majavah: "i wonder if for NFS we should just use the per-DC `dumps-lb.eqiad` address directly, instead of a CNAME like this?" [dns] - 10https://gerrit.wikimedia.org/r/1307339 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:20:57] (03CR) 10Majavah: [C:03+1] dumps-nfs: re-use dumps.w.o address [puppet] - 10https://gerrit.wikimedia.org/r/1307338 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:24:18] jmm@cumin2003 decommission (PID 2587110) is awaiting input [08:29:30] !log jmm@cumin2003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mirror1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2003" [08:31:52] !log jmm@cumin2003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mirror1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2003" [08:31:53] !log jmm@cumin2003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:31:54] !log jmm@cumin2003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mirror1001.wikimedia.org [08:32:11] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar), 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#12084350 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2003 for hosts: `mirror1001.wikime... [08:32:42] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12084351 (10MoritzMuehlenhoff) [08:34:23] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar), 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#12084376 (10MoritzMuehlenhoff) [08:34:52] (03PS2) 10Hashar: cache: turn off caching for zuul.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1306353 (https://phabricator.wikimedia.org/T430462) (owner: 10Dzahn) [08:40:58] (03PS1) 10Majavah: P:openstack: designate: Remove absented mcrouter resources [puppet] - 10https://gerrit.wikimedia.org/r/1307344 (https://phabricator.wikimedia.org/T427189) [08:41:04] (03PS1) 10Muehlenhoff: Remove mirror1001 from site.pp/preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1307345 (https://phabricator.wikimedia.org/T431088) [08:43:43] (03CR) 10Filippo Giunchedi: "Followup from irc: if/when dumps gets multi-site we can change the record" [dns] - 10https://gerrit.wikimedia.org/r/1307339 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [08:49:03] !log depooling cirrussearch in codfw because of regression after upgrade T431091 [08:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:06] T431091: Regression in search with OpenSearch 2 - https://phabricator.wikimedia.org/T431091 [08:49:51] !log atsuko@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search,name=codfw [08:50:00] !log atsuko@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search-psi,name=codfw [08:50:08] !log atsuko@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=search-omega,name=codfw [08:55:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [08:56:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:56:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:56:51] (03CR) 10Dreamy Jazz: "Hmm the formatting of my list didn't work :D" [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [08:57:20] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:57:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:59:12] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:00:50] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:04:06] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:04:19] (03CR) 10Majavah: [C:03+1] wikimedia.org: move dumps-nfs to dumps-lb [dns] - 10https://gerrit.wikimedia.org/r/1307339 (https://phabricator.wikimedia.org/T411248) (owner: 10Filippo Giunchedi) [09:04:54] (03CR) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [09:05:02] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:09:51] (03CR) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [09:16:54] !log cwilliams@cumin1003 START - Cookbook sre.mysql.multiinstance_reboot for db-test[2001-2002].codfw.wmnet [09:17:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [09:19:16] (03PS10) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [09:20:40] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:26] (03PS2) 10Giuseppe Lavagetto: hiddenparma: add known fingerprints file [puppet] - 10https://gerrit.wikimedia.org/r/1307013 [09:21:36] 06SRE, 06DBA: db1228 crashed - https://phabricator.wikimedia.org/T430934#12084620 (10Marostegui) a:05VRiley-WMF→03Marostegui [09:24:09] (03PS1) 10Marostegui: db1228: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1307357 (https://phabricator.wikimedia.org/T430934) [09:25:10] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.multiinstance_reboot (exit_code=99) for db-test[2001-2002].codfw.wmnet [09:26:39] (03CR) 10Marostegui: [C:03+2] db1228: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1307357 (https://phabricator.wikimedia.org/T430934) (owner: 10Marostegui) [09:27:24] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12084645 (10lado85) Not resolved. Peoblem still exist for some cases in ruwiki. Timeline is broken by small cyrillic letter х [09:29:28] (03PS1) 10Blake: mw-pretrain: remove nodePort service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307358 [09:29:43] (03PS1) 10Jcrespo: mediabackups: Remove insetup hosts backup1004-backup1007 [puppet] - 10https://gerrit.wikimedia.org/r/1307359 (https://phabricator.wikimedia.org/T431098) [09:29:45] (03PS1) 10Jcrespo: mediabackups: Remove insetup hosts backup2004-backup2007 [puppet] - 10https://gerrit.wikimedia.org/r/1307360 (https://phabricator.wikimedia.org/T431098) [09:29:49] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:30:05] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12084658 (10AndrewTavis_WMDE) Looking into this a bit, is it that @catherine.kelsey.wmde needs to be added to `analytics-wmde-users`? This at the very least should be done g... [09:30:37] (03CR) 10CI reject: [V:04-1] mediabackups: Remove insetup hosts backup2004-backup2007 [puppet] - 10https://gerrit.wikimedia.org/r/1307360 (https://phabricator.wikimedia.org/T431098) (owner: 10Jcrespo) [09:32:03] (03PS2) 10Jcrespo: mediabackups: Remove insetup hosts backup2004-backup2007 [puppet] - 10https://gerrit.wikimedia.org/r/1307360 (https://phabricator.wikimedia.org/T431098) [09:34:00] (03PS2) 10Jcrespo: mediabackups: Remove insetup hosts backup1004-backup1007 [puppet] - 10https://gerrit.wikimedia.org/r/1307359 (https://phabricator.wikimedia.org/T431097) [09:34:16] (03PS3) 10Jcrespo: mediabackups: Remove insetup hosts backup2004-backup2007 [puppet] - 10https://gerrit.wikimedia.org/r/1307360 (https://phabricator.wikimedia.org/T431098) [09:35:17] (03CR) 10Elukey: "All Jesse's suggestions should be taken care of, I'll re-test the cookbook with the sretest hosts to confirm that everything works as expe" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [09:36:29] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:36:37] 06SRE, 06DBA: db1228 crashed - https://phabricator.wikimedia.org/T430934#12084680 (10Marostegui) p:05High→03Medium [09:36:47] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts backup[1004-1007].eqiad.wmnet [09:37:08] 06SRE, 06DBA: db1228 crashed - https://phabricator.wikimedia.org/T430934#12084686 (10Marostegui) [09:37:37] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:38:39] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:39:10] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:42:47] (03PS1) 10Majavah: hieradata: Remove unused openstack endpoint keys [puppet] - 10https://gerrit.wikimedia.org/r/1307363 [09:46:48] 10SRE-SLO, 10observability, 10Wikidata, 06Wikidata Platform Team, and 3 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#12084742 (10Gehel) [09:52:02] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [09:53:10] 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-07-03 - 2026-07-31): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610#12084869 (10Gehel) [09:56:39] 06SRE, 06Data-Platform-SRE (2026-07-03 - 2026-07-31): archiva1002 has stale jobs in /var/cache/archiva that uses all the disk space - https://phabricator.wikimedia.org/T425083#12084961 (10Gehel) [09:57:41] 07sre-alert-triage, 06Data-Platform-SRE (2026-07-03 - 2026-07-31): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1208) - https://phabricator.wikimedia.org/T430138#12084992 (10Gehel) [09:57:51] 07sre-alert-triage, 06Data-Platform-SRE (2026-07-03 - 2026-07-31): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T430139#12084996 (10Gehel) [09:58:08] jynus@cumin1003 decommission (PID 1369876) is awaiting input [09:58:46] 07sre-alert-triage, 06Data-Platform-SRE (2026-07-03 - 2026-07-31): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413#12085023 (10Gehel) [09:58:56] 07sre-alert-triage, 06Data-Platform-SRE (2026-07-03 - 2026-07-31): Alert in need of triage: PuppetFailure (instance an-test-client1002:9100) - https://phabricator.wikimedia.org/T427399#12085027 (10Gehel) [10:01:42] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup[1004-1007].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [10:01:56] (03PS2) 10Muehlenhoff: Remove mirror1001 from site.pp/preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1307345 (https://phabricator.wikimedia.org/T431088) [10:02:01] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup[1004-1007].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [10:02:01] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:02:02] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup[1004-1007].eqiad.wmnet [10:02:38] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12085067 (10MoritzMuehlenhoff) [10:04:37] (03CR) 10Muehlenhoff: [C:03+2] Remove mirror1001 from site.pp/preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1307345 (https://phabricator.wikimedia.org/T431088) (owner: 10Muehlenhoff) [10:07:41] (03PS1) 10Btullis: datahub-next: disable atomic deploys for the bring-up [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307367 (https://phabricator.wikimedia.org/T402408) [10:11:27] (03PS1) 10Dpogorzelski: ml-serve: bigger kubelet LV for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/1307368 (https://phabricator.wikimedia.org/T431017) [10:19:09] 06SRE, 10SRE-Access-Requests: Superset data access request for abibendall - https://phabricator.wikimedia.org/T430938#12085114 (10cmooney) Hi @ABendall-WMF, thanks for the request. Access to the 'wmf' group should now be something you can add via our IDP platform at https://idm.wikimedia.org. Please see inst... [10:19:20] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [10:24:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:37:03] (03PS1) 10Muehlenhoff: Remove the acmechief config for mirrors.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1307378 (https://phabricator.wikimedia.org/T416707) [10:37:07] (03CR) 10Btullis: [C:03+2] datahub-next: disable atomic deploys for the bring-up [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307367 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [10:37:24] (03PS1) 10Kosta Harlan: Enable unit tests in CI and fix pre-existing test failures [dumps] - 10https://gerrit.wikimedia.org/r/1307379 [10:37:59] (03PS2) 10Kosta Harlan: Enable unit tests in CI and fix pre-existing test failures [dumps] - 10https://gerrit.wikimedia.org/r/1307379 [10:38:00] (03PS1) 10Btullis: datahub: remove the obsolete no-code migration job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307380 (https://phabricator.wikimedia.org/T402408) [10:39:24] (03Merged) 10jenkins-bot: datahub-next: disable atomic deploys for the bring-up [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307367 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [10:39:36] (03CR) 10CI reject: [V:04-1] Enable unit tests in CI and fix pre-existing test failures [dumps] - 10https://gerrit.wikimedia.org/r/1307379 (owner: 10Kosta Harlan) [10:40:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [10:41:48] (03PS1) 10Muehlenhoff: Remove role:mirror and related Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/1307383 (https://phabricator.wikimedia.org/T416707) [10:44:35] 06SRE, 10SRE-Access-Requests: Requesting access to cloudcontrol servers for tlepage - https://phabricator.wikimedia.org/T431010#12085228 (10cmooney) [10:45:11] (03CR) 10Hnowlan: [C:03+1] logstash: configure provisioning of admin certificate [puppet] - 10https://gerrit.wikimedia.org/r/1306001 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [10:46:12] (03PS3) 10Kosta Harlan: Enable unit tests in CI and fix pre-existing test failures [dumps] - 10https://gerrit.wikimedia.org/r/1307379 [10:52:11] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12085237 (10MoritzMuehlenhoff) [10:52:49] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:55:00] 06SRE, 10SRE-Access-Requests: Requesting access to cloudcontrol servers for tlepage - https://phabricator.wikimedia.org/T431010#12085250 (10cmooney) [10:58:38] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission mirror1001 - https://phabricator.wikimedia.org/T431088#12085263 (10MoritzMuehlenhoff) [10:58:48] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission mirror1001 - https://phabricator.wikimedia.org/T431088#12085265 (10MoritzMuehlenhoff) [10:59:05] (03PS9) 10Kosta Harlan: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260703T0700) [11:00:04] jelto, arnoldokoth, mutante, and arnaudb: I, the Bot under the Fountain, call upon thee, The Deployer, to do GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260703T1100). [11:00:37] (03CR) 10CI reject: [V:04-1] Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [11:00:50] (03PS1) 10Muehlenhoff: Remove Cumin alias for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1307395 (https://phabricator.wikimedia.org/T416707) [11:02:06] (03PS1) 10Cathal Mooney: dse-k8s: add new POD IPv4 block 10.68.0.0/17 to prefix lists [homer/public] - 10https://gerrit.wikimedia.org/r/1307396 (https://phabricator.wikimedia.org/T430658) [11:03:45] (03PS1) 10Kosta Harlan: WIP: Test fixes and unquoting [dumps] - 10https://gerrit.wikimedia.org/r/1307397 [11:05:39] (03CR) 10Muehlenhoff: [C:03+2] Remove Cumin alias for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1307395 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [11:09:33] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12085280 (10MoritzMuehlenhoff) >>! In T430961#12084658, @AndrewTavis_WMDE wrote: > Looking into this a bit, is it that @catherine.kelsey.wmde needs to be added to `analytics... [11:10:54] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12085281 (10MoritzMuehlenhoff) >>! In T430961#12085280, @MoritzMuehlenhoff wrote: >>>! In T430961#12084658, @AndrewTavis_WMDE wrote: >> Looking into this a bit, is it that @... [11:12:20] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12085286 (10catherine.kelsey.wmde) Hi @Lena_WMDE - please could you approve this? It's something that's required for me to be able to test/run DAGs. Thanks! [11:18:31] (03CR) 10Btullis: [C:03+2] datahub: remove the obsolete no-code migration job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307380 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [11:19:15] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307403 [11:20:48] (03Merged) 10jenkins-bot: datahub: remove the obsolete no-code migration job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307380 (https://phabricator.wikimedia.org/T402408) (owner: 10Btullis) [11:22:58] (03PS1) 10Hnowlan: kafka: migrate check_kafka_ssl to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1307405 (https://phabricator.wikimedia.org/T407117) [11:25:17] (03CR) 10Kamila Součková: "Thank you for checking, appreciated!" [puppet] - 10https://gerrit.wikimedia.org/r/1307205 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [11:25:22] (03CR) 10Kamila Součková: [C:03+2] deployment_server: disable mw-cgroup on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1307205 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [11:26:41] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12085332 (10AndrewTavis_WMDE) Responding to the two comments below: >>! In T430961#12084084, @MoritzMuehlenhoff wrote: > @AndrewTavis_WMDE @catherine.kelsey.wmde We can al... [11:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:56] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12085361 (10MoritzMuehlenhoff) [11:50:50] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12085369 (10Lena_WMDE) Hi all, I approve @catherine.kelsey.wmde 's request. [11:55:00] 06SRE, 10Bitu, 06Infrastructure-Foundations: Move tracking of LDAP access to a new repository and manage it from Bitu - https://phabricator.wikimedia.org/T431111 (10MoritzMuehlenhoff) 03NEW [11:57:12] 06SRE, 10Bitu, 06Infrastructure-Foundations: Move tracking of LDAP access to a new repository and manage it from Bitu - https://phabricator.wikimedia.org/T431111#12085395 (10LSobanski) [11:58:56] !log jynus@cumin2003 START - Cookbook sre.hosts.decommission for hosts backup[2004-2007].codfw.wmnet [11:59:23] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:59:25] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:01:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:01:39] 06SRE, 06Infrastructure-Foundations: Import/create samplicator source package - https://phabricator.wikimedia.org/T337208#12085400 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [12:02:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:07:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:08:12] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Terms of use for Mailman - https://phabricator.wikimedia.org/T431096#12085408 (10Peachey88) Potentially crosses over to {T340375} and we should probably do at the same time. [12:09:03] !log jynus@cumin2003 START - Cookbook sre.dns.netbox [12:10:42] 06SRE, 10Hiddenparma: Migrate CSV ipblock sources to hiddenparma - https://phabricator.wikimedia.org/T417587#12085416 (10MLechvien-WMF) [12:11:12] 06SRE, 10Hiddenparma: Migrate RIPE ipblock sources to hiddenparma - https://phabricator.wikimedia.org/T417586#12085417 (10MLechvien-WMF) [12:11:16] (03PS1) 10Hnowlan: ncredir: migrate nrpe check to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1307419 (https://phabricator.wikimedia.org/T407117) [12:12:13] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 07Sustainability (Incident Followup): The webrequest_sampled_live data pipeline and its query tools have become mission-critical and require re-engineering for resilience - https://phabricator.wikimedia.org/T431112 (10BTullis) 03NEW [12:12:16] (03PS4) 10Kosta Harlan: Enable unit tests in CI and fix pre-existing test failures [dumps] - 10https://gerrit.wikimedia.org/r/1307379 [12:12:42] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12085440 (10MoritzMuehlenhoff) @catherine.kelsey.wmde I've added you to the cn=airflow-wmde-ops LDAP group, please log out of the Wikimedia SSO by accessing https://idp.wiki... [12:12:50] (03PS5) 10Kosta Harlan: Enable unit tests in CI and fix pre-existing test failures [dumps] - 10https://gerrit.wikimedia.org/r/1307379 [12:12:54] (03PS1) 10Kamila Součková: site: switch deploy2003 to deployment_server role [puppet] - 10https://gerrit.wikimedia.org/r/1307421 (https://phabricator.wikimedia.org/T423714) [12:13:07] (03CR) 10Kosta Harlan: Filter change_tag and change_tag_def dumps (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [12:14:02] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307421 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [12:14:56] (03CR) 10Kamila Součková: "🍿" [puppet] - 10https://gerrit.wikimedia.org/r/1307421 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [12:15:02] !log jynus@cumin2003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup[2004-2007].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin2003" [12:15:18] !log jynus@cumin2003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: backup[2004-2007].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin2003" [12:15:18] !log jynus@cumin2003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:15:20] !log jynus@cumin2003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts backup[2004-2007].codfw.wmnet [12:15:31] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [12:15:43] (03CR) 10Jcrespo: [C:03+2] mediabackups: Remove insetup hosts backup1004-backup1007 [puppet] - 10https://gerrit.wikimedia.org/r/1307359 (https://phabricator.wikimedia.org/T431097) (owner: 10Jcrespo) [12:15:54] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [12:15:56] (03CR) 10Jcrespo: [C:03+2] mediabackups: Remove insetup hosts backup2004-backup2007 [puppet] - 10https://gerrit.wikimedia.org/r/1307360 (https://phabricator.wikimedia.org/T431098) (owner: 10Jcrespo) [12:17:53] 10ops-codfw, 06DC-Ops, 10decommission-hardware: Decommission backup2004, backup2005, backup2006 & backup2007 - https://phabricator.wikimedia.org/T431098#12085455 (10jcrespo) [12:18:44] 10ops-codfw, 06DC-Ops, 10decommission-hardware: Decommission backup2004, backup2005, backup2006 & backup2007 - https://phabricator.wikimedia.org/T431098#12085462 (10jcrespo) This is ready for DC ops, please be aware that it will take at least one additional month before we can decommission backup2003 (not re... [12:19:03] !log jmm@cumin2003 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader2005.wikimedia.org [12:19:26] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#12085476 (10ops-monitoring-bot) VM urldownloader2005.wikimedia.org rebooted by jmm@cumin2003 with reason: bump resources [12:20:12] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: Decommission backup1004, backup1005, backup1006 & backup1007 - https://phabricator.wikimedia.org/T431097#12085477 (10jcrespo) [12:20:16] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: Decommission backup1004, backup1005, backup1006 & backup1007 - https://phabricator.wikimedia.org/T431097#12085481 (10jcrespo) This is ready for DC ops, please be aware that it will take at least one additional month before we can decommission backup1003 (not re... [12:20:30] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:21:00] (03PS1) 10Majavah: signup: Do not use a valid, unrelated domain as a placeholder [software/bitu] - 10https://gerrit.wikimedia.org/r/1307422 [12:21:19] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12085482 (10MoritzMuehlenhoff) [12:22:09] (03PS10) 10Kosta Harlan: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [12:22:09] (03PS2) 10Kosta Harlan: WIP: Test fixes and unquoting [dumps] - 10https://gerrit.wikimedia.org/r/1307397 [12:23:32] !log jmm@cumin2003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader2005.wikimedia.org [12:23:46] (03CR) 10CI reject: [V:04-1] Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [12:24:11] (03CR) 10Kosta Harlan: [C:04-1] Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [12:24:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1307421 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [12:26:23] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2003.codfw.wmnet [12:32:30] !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2003.codfw.wmnet [12:35:59] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#12085542 (10MoritzMuehlenhoff) >>! In T427282#12075814, @MoritzMuehlenhoff wrote: > Squid got oom-killed on urldownloader1005. Traffic was stable across the day, but then traffic increase around the... [12:37:51] (03PS1) 10Muehlenhoff: Switch the URL downloaders back to the Trixie instances [dns] - 10https://gerrit.wikimedia.org/r/1307431 (https://phabricator.wikimedia.org/T427282) [12:38:31] ACKNOWLEDGEMENT - SSH on db1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Jcrespo bad state https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:38:32] ACKNOWLEDGEMENT - Host db1245 is DOWN: PING CRITICAL - Packet loss = 100% Jcrespo bad state [12:39:47] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [12:40:09] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [12:40:30] (03PS1) 10Hnowlan: Improve InterfaceSpeedError messages [alerts] - 10https://gerrit.wikimedia.org/r/1307432 (https://phabricator.wikimedia.org/T353323) [12:40:33] (03CR) 10Muehlenhoff: "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1307422 (owner: 10Majavah) [12:41:09] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [12:42:48] (03CR) 10CI reject: [V:04-1] Improve InterfaceSpeedError messages [alerts] - 10https://gerrit.wikimedia.org/r/1307432 (https://phabricator.wikimedia.org/T353323) (owner: 10Hnowlan) [12:46:18] 10ops-eqiad, 10Data-Persistence-Backup, 06DC-Ops: db1245 crashed - https://phabricator.wikimedia.org/T431115#12085562 (10jcrespo) @dcops , I tried hard-resetting the host remotely, but it doesn't boot up, it shows multiple power issues to cpu, other board locations. Please see if you can at least drain power... [12:46:47] (03PS1) 10Kamila Součková: hiera: add deploy2003 to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/1307433 (https://phabricator.wikimedia.org/T423714) [12:47:19] (03CR) 10Kamila Součková: [C:03+2] site: switch deploy2003 to deployment_server role [puppet] - 10https://gerrit.wikimedia.org/r/1307421 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [12:47:30] (03PS2) 10Hnowlan: Improve InterfaceSpeedError messages [alerts] - 10https://gerrit.wikimedia.org/r/1307432 (https://phabricator.wikimedia.org/T353323) [12:47:51] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [12:50:12] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest1005.eqiad.wmnet [12:50:13] (03PS1) 10AikoChou: changeprop: add liftwing revertrisk-wikidata to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307434 (https://phabricator.wikimedia.org/T420883) [12:50:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=0) for host sretest1005.eqiad.wmnet [12:51:47] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307433 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [12:52:05] (03PS11) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [12:52:14] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [12:52:44] (03CR) 10Kamila Součková: "pcc will not be useful for deployment2003 until after it's settled, at which point I'll rerun it, but I want to see the existing ones." [puppet] - 10https://gerrit.wikimedia.org/r/1307433 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [12:53:10] (03PS1) 10Majavah: Revert "P:dumps: rsync: Do not use LOAD_BALANCER_HEALTH_CHECKS" [puppet] - 10https://gerrit.wikimedia.org/r/1307435 [12:53:18] (03PS1) 10Jcrespo: mariadb & mediabackups: Replace db1245 during ongoing hw issues [puppet] - 10https://gerrit.wikimedia.org/r/1307436 (https://phabricator.wikimedia.org/T431115) [12:53:42] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=0) for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [12:54:15] (03PS2) 10Jcrespo: mariadb & mediabackups: Replace db1245 during ongoing hw issues [puppet] - 10https://gerrit.wikimedia.org/r/1307436 (https://phabricator.wikimedia.org/T431115) [12:54:58] (03PS1) 10Clément Goubert: tlsproxy: Fix ratelimit descriptor [puppet] - 10https://gerrit.wikimedia.org/r/1307437 (https://phabricator.wikimedia.org/T414440) [12:55:49] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307436 (https://phabricator.wikimedia.org/T431115) (owner: 10Jcrespo) [12:57:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1307433 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [12:57:37] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [12:57:49] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [12:58:36] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [13:00:06] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=0) for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [13:00:43] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12085604 (10MoritzMuehlenhoff) [13:01:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:42] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [13:02:57] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=99) for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [13:06:22] (03CR) 10Clément Goubert: [C:03+2] tlsproxy: Fix ratelimit descriptor [puppet] - 10https://gerrit.wikimedia.org/r/1307437 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [13:06:29] (03CR) 10Jcrespo: [C:03+2] mariadb & mediabackups: Replace db1245 during ongoing hw issues [puppet] - 10https://gerrit.wikimedia.org/r/1307436 (https://phabricator.wikimedia.org/T431115) (owner: 10Jcrespo) [13:06:51] ok to merge, claime? [13:06:59] yes please [13:07:01] doing [13:07:05] (03PS1) 10AikoChou: EventStreamConfig: add page_revert_risk_wikidata_prediction_change.v1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307438 (https://phabricator.wikimedia.org/T420883) [13:07:08] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [13:07:21] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [13:07:41] (03CR) 10Ssingh: [C:03+1] "Thanks, sounds like a plan. I am guessing we will aim for a Monday deploy?" [dns] - 10https://gerrit.wikimedia.org/r/1307431 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [13:08:05] claime: merge completed ok [13:08:18] tyvm [13:11:21] (03CR) 10Muehlenhoff: "Yes, I'll merge this Monday morning" [dns] - 10https://gerrit.wikimedia.org/r/1307431 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [13:13:11] (03PS12) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:13:20] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [13:13:27] (03PS1) 10Marostegui: production-m5.sql.erb: Remove ipoid grants [puppet] - 10https://gerrit.wikimedia.org/r/1307440 (https://phabricator.wikimedia.org/T431007) [13:14:38] !log imported samplicator 1.3.8rc1-1+deb13u1 to trixie-wikimedia/main T337208 [13:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:42] T337208: Import/create samplicator source package - https://phabricator.wikimedia.org/T337208 [13:14:53] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=0) for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [13:15:45] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [13:15:54] 06SRE, 06Infrastructure-Foundations: Import/create samplicator source package - https://phabricator.wikimedia.org/T337208#12085648 (10MoritzMuehlenhoff) 05Open→03Resolved I created a Debian package based on the last 1.3.8-rc release and uploaded it to trixie-wikimedia to the main component. [13:16:00] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [13:16:01] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest1005.eqiad.wmnet [13:16:19] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=99) for host sretest1005.eqiad.wmnet [13:17:07] (03PS1) 10AikoChou: ml-services: enable revertrisk-wikidata event stream predictions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307441 (https://phabricator.wikimedia.org/T420883) [13:17:48] (03PS13) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:17:55] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest1005.eqiad.wmnet [13:18:12] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=0) for host sretest1005.eqiad.wmnet [13:18:21] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T430961#12085660 (10catherine.kelsey.wmde) Thank you so much @MoritzMuehlenhoff - it's worked :) {F91686949} [13:20:40] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:11] 06SRE, 10LDAP-Access-Requests: Grant Access to for  - https://phabricator.wikimedia.org/T430961#12085665 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Thanks for confirming, resolving the task. [13:22:34] (03PS1) 10Hashar: zuul: allow encoded slashes [puppet] - 10https://gerrit.wikimedia.org/r/1307442 (https://phabricator.wikimedia.org/T431003) [13:23:10] (03PS14) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:24:01] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest1005.eqiad.wmnet [13:24:12] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=0) for host sretest1005.eqiad.wmnet [13:24:31] (03PS1) 10Ssingh: wikimedia.org: add CNAME for _dnsauth (VMC) [dns] - 10https://gerrit.wikimedia.org/r/1307443 (https://phabricator.wikimedia.org/T431062) [13:24:33] !log elukey@cumin1003 START - Cookbook sre.hosts.bmc-user-mgmt for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [13:26:07] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.bmc-user-mgmt (exit_code=0) for host sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1005-1006].eqiad.wmnet [13:26:30] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [13:26:53] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [13:29:02] (03CR) 10Elukey: "@jhathaway@wikimedia.org @mmuhlenhoff@wikimedia.org I test-cookbooked this with all the sretest* hosts, changing passwords etc.. and it lo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [13:29:33] (03PS2) 10Ssingh: wikimedia.org: add CNAME for _dnsauth (VMC) [dns] - 10https://gerrit.wikimedia.org/r/1307443 (https://phabricator.wikimedia.org/T431062) [13:31:00] (03PS3) 10Ssingh: wikimedia.org: add CNAME for _dnsauth (VMC) [dns] - 10https://gerrit.wikimedia.org/r/1307443 (https://phabricator.wikimedia.org/T431062) [13:32:24] (03CR) 10Fabfur: [C:03+1] wikimedia.org: add CNAME for _dnsauth (VMC) [dns] - 10https://gerrit.wikimedia.org/r/1307443 (https://phabricator.wikimedia.org/T431062) (owner: 10Ssingh) [13:33:20] (03CR) 10Ssingh: [C:03+2] wikimedia.org: add CNAME for _dnsauth (VMC) [dns] - 10https://gerrit.wikimedia.org/r/1307443 (https://phabricator.wikimedia.org/T431062) (owner: 10Ssingh) [13:35:58] !log sukhe@dns1004 START - running authdns-update [13:38:14] !log sukhe@dns1004 END - running authdns-update [13:43:08] (03PS1) 10Hashar: zuul: move vim: modeline to top of apache conf [puppet] - 10https://gerrit.wikimedia.org/r/1307447 [13:51:29] (03PS2) 10Kamila Součková: hiera: add deploy2003 to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/1307433 (https://phabricator.wikimedia.org/T423714) [13:51:35] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307433 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [13:58:39] (03CR) 10Kamila Součková: [C:04-1] "Is the change in port (4456 -> 4444) intentional? According to https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports , 4444 is mw-d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307358 (owner: 10Blake) [14:00:26] (03CR) 10Kamila Součková: [C:03+2] hiera: add deploy2003 to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/1307433 (https://phabricator.wikimedia.org/T423714) (owner: 10Kamila Součková) [14:05:57] (03CR) 10Kamila Součková: [C:03+1] "My bad, this is fine if you're not setting up ingress in the same step." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307358 (owner: 10Blake) [14:08:17] (03CR) 10Blake: [C:03+2] mw-pretrain: remove nodePort service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307358 (owner: 10Blake) [14:10:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:10:33] (03Merged) 10jenkins-bot: mw-pretrain: remove nodePort service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1307358 (owner: 10Blake) [14:15:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:16:02] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:19:43] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new link IP dns for trasnport circuits to ulsfo - cmooney@cumin1003" [14:19:47] (03PS1) 10Cathal Mooney: Add INCLUDE statements for PTR ranges for new ulsfo transport ranges [dns] - 10https://gerrit.wikimedia.org/r/1307455 (https://phabricator.wikimedia.org/T424839) [14:20:03] 06SRE, 06Data-Engineering, 06Data-Platform-SRE (2026-07-03 - 2026-07-31), 07Sustainability (Incident Followup): The webrequest_sampled_live data pipeline and its query tools have become mission-critical and require re-engineering for resilience - https://phabricator.wikimedia.org/T431112#12085861 (10BTullis... [14:22:48] cmooney@cumin1003 netbox (PID 1399738) is awaiting input [14:23:20] (03CR) 10Ssingh: [C:03+1] Add INCLUDE statements for PTR ranges for new ulsfo transport ranges [dns] - 10https://gerrit.wikimedia.org/r/1307455 (https://phabricator.wikimedia.org/T424839) (owner: 10Cathal Mooney) [14:24:28] (03PS1) 10Ssingh: admin: remove non-Yubikey SSH key for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/1307457 [14:25:54] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDE statements for PTR ranges for new ulsfo transport ranges [dns] - 10https://gerrit.wikimedia.org/r/1307455 (https://phabricator.wikimedia.org/T424839) (owner: 10Cathal Mooney) [14:26:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new link IP dns for trasnport circuits to ulsfo - cmooney@cumin1003" [14:26:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:26:18] !log cmooney@dns3003 START - running authdns-update [14:30:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1307457 (owner: 10Ssingh) [14:31:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:36:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:40:15] !log cmooney@dns3003 END - running authdns-update [14:44:20] (03PS1) 10Hashar: zuul: add some cache-control to web responses [puppet] - 10https://gerrit.wikimedia.org/r/1307459 (https://phabricator.wikimedia.org/T430462) [14:54:22] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#12085976 (10MoritzMuehlenhoff) [14:57:31] 10SRE-swift-storage, 06Infrastructure-Foundations, 06MediaWiki-Platform-Team, 10MediaWiki-Uploading, 07Wikimedia-production-error: MediaWiki\Upload\Exception\UploadChunkFileException: Error storing file in '{chunkPath}': backend-fail-internal; local-swif... - https://phabricator.wikimedia.org/T430986#12085998 [14:58:26] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12086001 (10cmooney) Looking at T429338 one thing that occurs to me is does it make sense to Anycast to the LVS? What I mean is use... [15:02:35] (03CR) 10SD0001: tables-catalog: set betafeatures_user_counts to public visibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1298329 (https://phabricator.wikimedia.org/T402145) (owner: 10SD0001) [15:08:10] (03CR) 10Ssingh: [C:03+2] admin: remove non-Yubikey SSH key for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/1307457 (owner: 10Ssingh) [15:16:02] 06SRE, 06Data-Engineering, 06Data-Platform-SRE (2026-07-03 - 2026-07-31), 07Sustainability (Incident Followup): The webrequest_sampled_live data pipeline and its query tools have become mission-critical and require re-engineering for resilience - https://phabricator.wikimedia.org/T431112#12086035 (10elukey... [15:26:42] (03Abandoned) 10Dreamy Jazz: MariaDB grants: Drop ipoid database users [puppet] - 10https://gerrit.wikimedia.org/r/1307208 (https://phabricator.wikimedia.org/T431007) (owner: 10Dreamy Jazz) [15:27:08] (03CR) 10Dreamy Jazz: "(I've abandoned Ifb28379056a689514de5a32cd67d22286397d24e in favour of this as these commits are the same)" [puppet] - 10https://gerrit.wikimedia.org/r/1307440 (https://phabricator.wikimedia.org/T431007) (owner: 10Marostegui) [15:27:15] (03CR) 10Dreamy Jazz: [C:03+1] production-m5.sql.erb: Remove ipoid grants [puppet] - 10https://gerrit.wikimedia.org/r/1307440 (https://phabricator.wikimedia.org/T431007) (owner: 10Marostegui) [15:34:38] (03PS1) 10Ssingh: Revert "varnish: Add CSP report-only header value" [puppet] - 10https://gerrit.wikimedia.org/r/1307474 [15:34:44] (03PS2) 10Atsuko: cirrussearch: adding max_clause_count parameter [puppet] - 10https://gerrit.wikimedia.org/r/1307470 (https://phabricator.wikimedia.org/T431086) [15:35:50] !log atsuko@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search,name=codfw [15:35:58] !log atsuko@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search-psi,name=codfw [15:36:04] !log atsuko@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=search-omega,name=codfw [15:36:08] (03CR) 10Ssingh: "I should have repeated it so it's my bad but as mentioned in the comment above, we should not have merged this until Scott gave his +1. @s" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [15:37:08] (03CR) 10Ssingh: "In the future, I will make sure that an explicit -1 has been provided so we don't make the same mistake." [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [15:38:00] (03CR) 10DCausse: [C:03+1] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1307470 (https://phabricator.wikimedia.org/T431086) (owner: 10Atsuko) [15:44:06] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on asw1-[22-23]-ulsfo,cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt with reason: upgrade JunOS cr3-ulsfo [15:45:01] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on lvs[4008-4010].ulsfo.wmnet with reason: upgrade JunOS cr3-ulsfo [15:47:58] 06SRE, 10hCaptcha, 06Product Safety and Integrity, 10ServiceOps-Mediawiki, 06Traffic: memcached errors seen in hCaptcha health checks - https://phabricator.wikimedia.org/T430340#12086110 (10ssingh) [15:48:11] 06SRE, 10hCaptcha, 06Product Safety and Integrity, 10ServiceOps-Mediawiki, 06Traffic: memcached errors seen in hCaptcha health checks - https://phabricator.wikimedia.org/T430340#12086123 (10ssingh) I am taking the liberty to ServiceOps since they own memcached. [15:48:30] (03CR) 10Ssingh: "Aiming for a Monday July 6 release." [puppet] - 10https://gerrit.wikimedia.org/r/1305635 (https://phabricator.wikimedia.org/T426379) (owner: 10Fabfur) [15:52:13] !log adjust outbound BGP policies on cr3-ulsfo to drain router of traffic T424839 [15:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:18] (03CR) 10Ssingh: trafficserver: raise ATS timeouts for the gerrit secondary backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1304808 (https://phabricator.wikimedia.org/T429749) (owner: 10Arnaudb) [16:09:43] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:11:13] (03PS11) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) [16:11:15] (03CR) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [16:11:49] (03PS12) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) [16:12:22] (03CR) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [16:13:18] (03CR) 10CI reject: [V:04-1] Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [16:13:52] (03PS3) 10Daniel Kinzler: smokepy: use live mount for test files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302104 (https://phabricator.wikimedia.org/T424825) [16:14:13] (03CR) 10CI reject: [V:04-1] smokepy: use live mount for test files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302104 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [16:14:43] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:16] (03PS1) 10Kamila Součková: Revert "hiera: add deploy2003 to deployment servers" [puppet] - 10https://gerrit.wikimedia.org/r/1307480 [16:15:56] (03CR) 10CI reject: [V:04-1] Revert "hiera: add deploy2003 to deployment servers" [puppet] - 10https://gerrit.wikimedia.org/r/1307480 (owner: 10Kamila Součková) [16:16:27] (03PS2) 10Kamila Součková: Revert "hiera: add deploy2003 to deployment servers" [puppet] - 10https://gerrit.wikimedia.org/r/1307480 [16:16:34] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307480 (owner: 10Kamila Součková) [16:22:50] (03PS13) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) [16:24:24] (03CR) 10CI reject: [V:04-1] Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [16:24:43] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Transport: Arelion (IC-398709) {#20260602}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:26:36] (03PS4) 10Daniel Kinzler: smokepy: use live mount for test files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302104 (https://phabricator.wikimedia.org/T424825) [16:26:54] (03CR) 10CI reject: [V:04-1] smokepy: use live mount for test files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302104 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [16:28:00] (03PS14) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) [16:28:10] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "P:dumps: rsync: Do not use LOAD_BALANCER_HEALTH_CHECKS" [puppet] - 10https://gerrit.wikimedia.org/r/1307435 (owner: 10Majavah) [16:29:12] (03CR) 10Daniel Kinzler: smokepy: use live mount for test files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302104 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [16:29:32] (03CR) 10CI reject: [V:04-1] Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [16:30:31] (03PS15) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) [16:30:51] (03CR) 10Dreamy Jazz: "Thanks merged these changes into my commit" [dumps] - 10https://gerrit.wikimedia.org/r/1307397 (owner: 10Kosta Harlan) [16:31:39] (03PS16) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) [16:33:05] (03CR) 10Filippo Giunchedi: [C:03+1] hieradata: Remove unused openstack endpoint keys [puppet] - 10https://gerrit.wikimedia.org/r/1307363 (owner: 10Majavah) [16:35:10] (03CR) 10Dreamy Jazz: Enable unit tests in CI and fix pre-existing test failures (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1307379 (owner: 10Kosta Harlan) [16:35:44] (03PS17) 10Dreamy Jazz: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) [16:36:11] (03PS3) 10Kamila Součková: Revert "hiera: add deploy2003 to deployment servers" [puppet] - 10https://gerrit.wikimedia.org/r/1307480 [16:36:14] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1307480 (owner: 10Kamila Součková) [16:40:12] 06SRE, 10MediaWiki-extensions-ModeratorToolkit, 06Moderator-Tools-Team: Production Readiness Checklist for the ModeratorToolkit Extension - https://phabricator.wikimedia.org/T431133 (10DMburugu) 03NEW [16:41:28] 06SRE, 10MediaWiki-extensions-ModeratorToolkit, 06Moderator-Tools-Team: Production Readiness Checklist for the ModeratorToolkit Extension - https://phabricator.wikimedia.org/T431133#12086236 (10DMburugu) [16:45:47] (03CR) 10Kamila Součková: [C:03+2] Revert "hiera: add deploy2003 to deployment servers" [puppet] - 10https://gerrit.wikimedia.org/r/1307480 (owner: 10Kamila Součková) [16:48:53] !log reboot cr3-ulsfo to upgrade JunOS and reset linecard T424839 [16:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:59] (03PS19) 10FNegri: sre.mysql.multiinstance_reboot: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [16:50:10] (03PS1) 10SD0001: maintain-views: add view for global_edit_count [puppet] - 10https://gerrit.wikimedia.org/r/1307487 (https://phabricator.wikimedia.org/T344108) [16:51:12] (03CR) 10FNegri: sre.mysql.multiinstance_reboot: new cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [16:52:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:52:31] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:52:49] ^^ this is me, I somehow forgot the downtime for the other router [16:53:27] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr4-ulsfo with reason: upgrade JunOS cr3-ulsfo [16:53:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:53:48] ugh [16:53:51] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-eqord with reason: upgrade JunOS cr3-ulsfo [16:57:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:00:31] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:08:02] !log revert protocol preference changes on cr3-ulsfo after upgrade [17:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:14:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:17:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:20:40] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:22:44] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386#12086301 (10cmooney) FWIW I upgraded cr3-ulsfo as I had to drain it to reset the PIC and it was on known-bad release 23.4R2-S7. [17:29:45] (03Abandoned) 10Kosta Harlan: WIP: Test fixes and unquoting [dumps] - 10https://gerrit.wikimedia.org/r/1307397 (owner: 10Kosta Harlan) [17:47:15] (03CR) 10Kosta Harlan: Enable unit tests in CI and fix pre-existing test failures (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1307379 (owner: 10Kosta Harlan) [17:55:38] (03PS6) 10Kosta Harlan: Enable unit tests in CI and fix pre-existing test failures [dumps] - 10https://gerrit.wikimedia.org/r/1307379 [17:55:38] (03PS18) 10Kosta Harlan: Filter change_tag and change_tag_def dumps [dumps] - 10https://gerrit.wikimedia.org/r/1307262 (https://phabricator.wikimedia.org/T386456) (owner: 10Dreamy Jazz) [17:56:08] (03CR) 10Kosta Harlan: Enable unit tests in CI and fix pre-existing test failures (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1307379 (owner: 10Kosta Harlan) [18:46:18] 06SRE, 06Data-Engineering, 06Data-Platform-SRE (2026-07-03 - 2026-07-31), 07Sustainability (Incident Followup): The webrequest_sampled_live data pipeline and its query tools have become mission-critical and require re-engineering for resilience - https://phabricator.wikimedia.org/T431112#12086387 (10LSobans... [19:00:22] (03CR) 10Ladsgroup: "most of these links are intentionally shortened" [puppet] - 10https://gerrit.wikimedia.org/r/1306094 (owner: 10Simon04) [19:10:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:11:16] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#12086445 (10Ladsgroup) [19:11:36] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#12086447 (10Ladsgroup) [19:25:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:54:41] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#12086474 (10Nemoralis) [20:01:10] (03CR) 10Simon04: "Hi, I started this patch because I figured that "Use thumbnail sizes listed on https://w.wiki/GHai" could give some more insights. Immedia" [puppet] - 10https://gerrit.wikimedia.org/r/1306094 (owner: 10Simon04) [20:27:21] PROBLEM - Docker registry HTTPS interface on registry1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [20:28:11] RECOVERY - Docker registry HTTPS interface on registry1005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Docker [20:38:06] (03CR) 10Dreamy Jazz: [C:03+1] "Seems fine to me" [dumps] - 10https://gerrit.wikimedia.org/r/1307379 (owner: 10Kosta Harlan) [20:56:46] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277#12086714 (10Tvassilian) 05Open→03In progress [21:20:40] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:36] (03PS1) 10Jforrester: logging: Switch the wmfconfig processor to Monolog 3's type (LogRecord) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1307506 (https://phabricator.wikimedia.org/T397070) [23:42:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1307524 [23:42:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1307524 (owner: 10TrainBranchBot) [23:50:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1307524 (owner: 10TrainBranchBot)