[00:04:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191865 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191865 (owner: 10TrainBranchBot) [00:29:33] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191865 (owner: 10TrainBranchBot) [01:00:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:14:30] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 50s) [01:36:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [02:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:32] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:09:10] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:24] (03CR) 10Finchgold: [C:03+1] Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191804 (https://phabricator.wikimedia.org/T405830) (owner: 10Lucas Werkmeister) [05:06:13] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191364 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [05:08:10] (03Merged) 10jenkins-bot: Update cxserver to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191364 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [05:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:35] (03PS1) 10KartikMistry: Revert "Update cxserver to 2025-09-25-074241-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191872 [05:20:36] (03CR) 10KartikMistry: [C:03+2] Revert "Update cxserver to 2025-09-25-074241-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191872 (owner: 10KartikMistry) [05:22:34] (03Merged) 10jenkins-bot: Revert "Update cxserver to 2025-09-25-074241-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191872 (owner: 10KartikMistry) [05:34:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:56:58] PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [05:58:50] RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Docker [06:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [06:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:28:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [06:33:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet [06:37:48] !log upgrade Envoy on chartmuseum hosts T403663 [06:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:56] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [06:38:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet [06:38:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet [06:38:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [06:41:11] dse-k8s-etcd1003 and ml-etcd1002 will go down for a Ganeti reboot [06:41:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1048.eqiad.wmnet [06:43:14] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [06:43:32] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [06:45:32] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [06:45:42] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [06:46:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1048.eqiad.wmnet [06:46:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet [06:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:07:48] (03CR) 10Jelto: [V:03+1 C:03+2] ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [07:13:19] (03CR) 10Fabfur: [C:03+1] haproxy: use Lua 5.3 for Docker tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis) [07:17:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet [07:19:40] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [07:20:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1049.eqiad.wmnet [07:21:12] (03PS3) 10Majavah: haproxy::cloud: Add an admin-level socket [puppet] - 10https://gerrit.wikimedia.org/r/1191662 [07:21:12] (03PS6) 10Majavah: haproxy::cloud: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 [07:21:46] (03CR) 10Elukey: [C:03+1] imposm-initial-import: Set service passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:23:33] (03CR) 10Majavah: [C:03+2] haproxy::cloud: Add an admin-level socket [puppet] - 10https://gerrit.wikimedia.org/r/1191662 (owner: 10Majavah) [07:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1049.eqiad.wmnet [07:25:42] (03CR) 10Majavah: [C:03+2] haproxy::cloud: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 (owner: 10Majavah) [07:25:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1049.eqiad.wmnet [07:25:58] PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [07:26:50] RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Docker [07:27:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1050.eqiad.wmnet [07:32:19] jmm@cumin2002 drain-node (PID 3019950) is awaiting input [07:32:29] 06SRE, 06collaboration-services, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11223257 (10MoritzMuehlenhoff) [07:32:32] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:33:09] (03CR) 10Elukey: osm_master: Store kartotherian and tegola passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:34:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1050.eqiad.wmnet [07:35:21] (03CR) 10Muehlenhoff: osm_master: Store kartotherian and tegola passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:37:11] !log upgrade Envoy on config-master* T403663 [07:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:18] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [07:39:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1050.eqiad.wmnet [07:39:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1050.eqiad.wmnet [07:41:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet [07:44:39] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v11.9.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191983 [07:44:44] (03PS1) 10Jelto: ceph::client::sync_local: fix ensure for directory [puppet] - 10https://gerrit.wikimedia.org/r/1191984 (https://phabricator.wikimedia.org/T378922) [07:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:45:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1051.eqiad.wmnet [07:46:22] 07Puppet, 10MobileFrontend (Tracking): Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425#11223314 (10Krinkle) [07:46:57] 07Puppet, 10MobileFrontend (Tracking): Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425#11223316 (10Krinkle) [07:47:01] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1191984 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [07:47:14] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11223317 (10Krinkle) [07:51:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1051.eqiad.wmnet [07:51:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet [07:52:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1052.eqiad.wmnet [07:52:13] (03CR) 10Elukey: [C:03+1] osm_master: Store kartotherian and tegola passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:52:33] (03CR) 10Hashar: [C:04-1] phabricator: hiera'ize the apc_shm_size variable (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [07:55:06] ml-etcd1003 will go down for a Ganeti reboot [07:55:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1052.eqiad.wmnet [07:55:27] (03CR) 10Muehlenhoff: [C:03+2] osm_master: Store kartotherian and tegola passwords [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:55:34] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v11.9.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191983 (owner: 10Elukey) [07:56:34] PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [07:58:26] (03CR) 10Jelto: [V:03+1 C:03+2] ceph::client::sync_local: fix ensure for directory [puppet] - 10https://gerrit.wikimedia.org/r/1191984 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:00:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1052.eqiad.wmnet [08:00:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1052.eqiad.wmnet [08:00:42] RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [08:01:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1053.eqiad.wmnet [08:02:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1053.eqiad.wmnet [08:02:25] (03PS1) 10Elukey: Upstream release v11.9.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1192050 [08:05:55] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v11.9.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1192050 (owner: 10Elukey) [08:07:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1053.eqiad.wmnet [08:07:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1053.eqiad.wmnet [08:08:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1054.eqiad.wmnet [08:09:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:34] (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191984 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:11:55] jmm@cumin2002 drain-node (PID 3041113) is awaiting input [08:13:33] (03PS1) 10Muehlenhoff: Don't set profile::maps::osm_master::tilerator_pass in role default [puppet] - 10https://gerrit.wikimedia.org/r/1192054 (https://phabricator.wikimedia.org/T381565) [08:14:41] (03PS1) 10Slyngshede: Update CAS to version 7.1.6.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1192055 [08:15:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1054.eqiad.wmnet [08:16:28] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11223446 (10Jelto) Sync from object storage to a local folder works with the new `ceph::client::sync_local` module. I tested this o... [08:17:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1192055 (owner: 10Slyngshede) [08:17:35] (03PS1) 10Elukey: sre.hosts.provision: update to Redfish's hw_model [cookbooks] - 10https://gerrit.wikimedia.org/r/1192056 [08:18:41] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update CAS to version 7.1.6.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1192055 (owner: 10Slyngshede) [08:20:30] 10SRE-tools, 06Infrastructure-Foundations: CI error on operations/cookbooks - https://phabricator.wikimedia.org/T405706#11223453 (10elukey) https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1192056 [08:20:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1054.eqiad.wmnet [08:20:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1054.eqiad.wmnet [08:20:52] (03CR) 10Arnaudb: [C:03+1] "thanks for the quickfix! lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1192056 (owner: 10Elukey) [08:20:54] !log uploaded spicerack_11.9.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [08:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [08:26:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [08:32:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [08:32:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [08:33:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:35:16] !log rolled out spicerack 11.9.0 to all cumin nodes [08:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:22] (03PS1) 10Slyngshede: IDP: Upgrade to CAS 7.1.6.2 [dns] - 10https://gerrit.wikimedia.org/r/1192058 [08:35:29] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: update to Redfish's hw_model [cookbooks] - 10https://gerrit.wikimedia.org/r/1192056 (owner: 10Elukey) [08:35:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [08:36:16] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 10Spicerack: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397#11223585 (10elukey) Spicerack 11.9.0 deployed on all cumin nodes :) [08:37:34] (03CR) 10Elukey: [C:03+1] Don't set profile::maps::osm_master::tilerator_pass in role default [puppet] - 10https://gerrit.wikimedia.org/r/1192054 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:38:21] (03CR) 10Slyngshede: [C:03+2] IDP: Upgrade to CAS 7.1.6.2 [dns] - 10https://gerrit.wikimedia.org/r/1192058 (owner: 10Slyngshede) [08:38:27] !log slyngshede@dns1004 START - running authdns-update [08:38:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:39:28] (03CR) 10Muehlenhoff: [C:03+2] Don't set profile::maps::osm_master::tilerator_pass in role default [puppet] - 10https://gerrit.wikimedia.org/r/1192054 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:39:50] !log slyngshede@dns1004 END - running authdns-update [08:43:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [08:44:53] 10SRE-tools, 06Infrastructure-Foundations: CI error on operations/cookbooks - https://phabricator.wikimedia.org/T405706#11223632 (10ABran-WMF) 05Open→03Resolved a:03elukey [[ https://integration.wikimedia.org/ci/job/tox/7677/console | CI went through ]], thanks for the fix! [08:45:23] (03PS7) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) [08:45:49] (03PS10) 10Arnaudb: gerrit: bugfixes on failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) [08:45:49] (03CR) 10Arnaudb: "full dry-run output is visible here: https://phabricator.wikimedia.org/P83469" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:46:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [08:49:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [08:49:31] (03PS1) 10Slyngshede: Revert "IDP: Upgrade to CAS 7.1.6.2" [dns] - 10https://gerrit.wikimedia.org/r/1192059 [08:49:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [08:50:41] (03PS1) 10Jelto: gitlab: enable bucket sync on production host [puppet] - 10https://gerrit.wikimedia.org/r/1192060 (https://phabricator.wikimedia.org/T378922) [08:51:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [08:52:02] !log powercycling db1150 T405885 [08:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:08] T405885: db1150 crash - https://phabricator.wikimedia.org/T405885 [08:52:19] (03CR) 10Slyngshede: [C:03+2] Revert "IDP: Upgrade to CAS 7.1.6.2" [dns] - 10https://gerrit.wikimedia.org/r/1192059 (owner: 10Slyngshede) [08:52:32] !log slyngshede@dns1004 START - running authdns-update [08:53:17] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1192060 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:53:57] !log slyngshede@dns1004 END - running authdns-update [08:54:39] jmm@cumin2002 drain-node (PID 3061885) is awaiting input [08:54:52] RECOVERY - Host db1150 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [08:54:56] PROBLEM - MariaDB Replica SQL: s3 on db1150 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:04] PROBLEM - MariaDB Replica SQL: s4 on db1150 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:04] PROBLEM - MariaDB Replica IO: s4 on db1150 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:26] PROBLEM - MariaDB Replica IO: s3 on db1150 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:56:14] PROBLEM - MariaDB read only s3 on db1150 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:56:29] (03PS2) 10Muehlenhoff: imposm-initial-import: Set service passwords [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565) [08:56:45] (03CR) 10Muehlenhoff: imposm-initial-import: Set service passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:57:04] PROBLEM - mysqld processes on db1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:57:04] kubestagemaster2005 will go down for a Ganeti reboot [08:57:14] PROBLEM - MariaDB read only s4 on db1150 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:58:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [09:00:14] PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100% [09:00:47] (03PS1) 10Slyngshede: IDP: CAS 7.1.6.2 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1192062 [09:03:03] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11223739 (10elukey) To recap, it seems that we have two problems: 1) For some mysterious reasons, sretest2010 seems to have stopped... [09:03:35] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11223742 (10elukey) >>! In T404356#11217341, @jhathaway wrote: >>>! In T404356#11184299, @elukey wrote: >> The host doesn't PXE/HTTP boot for some reason, I reopened the provision... [09:04:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [09:04:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [09:04:57] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:05:24] RECOVERY - Host kubestagemaster2005 is UP: PING WARNING - Packet loss = 80%, RTA = 30.49 ms [09:05:53] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:06:42] (03CR) 10Sergio Gimeno: "This is now ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno) [09:07:17] 06SRE, 06Product Safety and Integrity, 05MW-1.45-notes (1.45.0-wmf.21; 2025-09-30), 13Patch-For-Review, 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11223756 (10OKryva-WMF) [09:08:35] (03PS3) 10Sergio Gimeno: Growth: enable new notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) [09:09:26] (03CR) 10Fabfur: [C:03+1] IDP: CAS 7.1.6.2 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1192062 (owner: 10Slyngshede) [09:09:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:10:27] (03CR) 10Slyngshede: [C:03+2] IDP: CAS 7.1.6.2 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1192062 (owner: 10Slyngshede) [09:10:32] !log slyngshede@dns1004 START - running authdns-update [09:11:04] !log Upgrading IDP/CAS-SSO to version 7.1.6.2 [09:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:08] 06SRE, 05MW-1.45-notes (1.45.0-wmf.21; 2025-09-30), 13Patch-For-Review, 06Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to ... - https://phabricator.wikimedia.org/T404204#11223798 [09:11:57] !log slyngshede@dns1004 END - running authdns-update [09:13:12] (03CR) 10Fabfur: [C:03+2] varnish: remove Host header normalization [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [09:23:10] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [09:24:00] (03PS1) 10Jcrespo: dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150 [puppet] - 10https://gerrit.wikimedia.org/r/1192069 (https://phabricator.wikimedia.org/T405885) [09:24:03] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [09:24:36] !log restarting blazegraph on wdqs2007, wdqs2021 and wdqs2011 (high thread count) [09:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:07] (03PS1) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:25:39] (03CR) 10CI reject: [V:04-1] tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [09:26:46] (03PS2) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:27:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:28:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:31:20] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [09:31:42] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [09:32:17] RESOLVED: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:32:20] (03PS3) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:32:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:34:36] (03CR) 10David Caro: tools: add more reliable stats on nfs stuck workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [09:34:56] (03CR) 10Filippo Giunchedi: tools: add more reliable stats on nfs stuck workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [09:37:09] (03PS4) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:37:24] (03PS5) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:37:26] (03CR) 10David Caro: tools: add more reliable stats on nfs stuck workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [09:37:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [09:37:54] (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Set service passwords [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:37:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:38:45] (03CR) 10David Caro: tools: add more reliable stats on nfs stuck workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [09:41:23] !log depooling wdqs2007, wdqs2021 and wdqs2011 (update lag) [09:41:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:54] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: MariaDB package update [09:42:33] ml-staging-etcd2001 will go down for a ganeti reboot [09:42:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [09:43:03] (03PS6) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:43:32] (03CR) 10CI reject: [V:04-1] tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [09:43:48] (03PS7) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:44:18] (03CR) 10CI reject: [V:04-1] tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [09:44:20] PROBLEM - Host ml-staging-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:45:11] (03PS2) 10Jcrespo: dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150 [puppet] - 10https://gerrit.wikimedia.org/r/1192069 (https://phabricator.wikimedia.org/T405885) [09:45:12] (03PS8) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:48:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [09:48:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [09:49:10] (03CR) 10Arnaudb: "some answers, and a question inline." [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [09:49:55] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [09:50:20] (03PS2) 10Muehlenhoff: Add maps1012 to maps1014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565) [09:50:32] RECOVERY - Host ml-staging-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [09:50:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [09:52:25] (03PS9) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:52:59] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150 [puppet] - 10https://gerrit.wikimedia.org/r/1192069 (https://phabricator.wikimedia.org/T405885) (owner: 10Jcrespo) [09:53:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [09:55:35] (03CR) 10Elukey: [C:03+1] Add maps1012 to maps1014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:58:39] (03PS10) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 [09:59:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [10:00:04] (03PS1) 10Elukey: aptrepo: add kubernetes 1.23 support to Trixie Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1192087 (https://phabricator.wikimedia.org/T405891) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1000) [10:00:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [10:00:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [10:01:53] (03CR) 10Arnaudb: [C:03+1] "pcc looks good, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1192060 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:03:19] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1245.eqiad.wmnet [10:03:19] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1245.eqiad.wmnet [10:04:51] jmm@cumin2002 drain-node (PID 3097273) is awaiting input [10:08:39] ACKNOWLEDGEMENT - snapshot of s3 in eqiad on backupmon1001 is CRITICAL: snapshot for s3 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2025-09-25 05:35:19 Jcrespo T405885 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:08:39] ACKNOWLEDGEMENT - snapshot of s4 in eqiad on backupmon1001 is CRITICAL: snapshot for s4 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2025-09-25 02:18:53 Jcrespo T405885 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:13:55] (03PS3) 10Clément Goubert: rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) [10:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [10:16:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [10:19:13] (03CR) 10David Caro: [C:03+2] tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro) [10:19:31] (03PS6) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [10:19:43] (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [10:22:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [10:22:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [10:23:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [10:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:25:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [10:25:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11224014 (10jcrespo) Sorry I didn't provide details last week, but it was quite late in my timezone. You already saw the issue, which I was late to detect because everything else w... [10:27:01] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dbprov1007.eqiad.wmnet with reason: needs reimage [10:27:20] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11224033 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c6c8bf64-47b3-4e1e-b33d-0785ef15336a) set by jynus@cumin1003 for 7 days, 0:00:00 on 1 host(s) and their... [10:27:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:29:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192087 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [10:31:25] (03CR) 10Effie Mouzeli: "Looks OK (as on a regex can look).It would be great if in the future we add a couple of comments in the yaml file to explain what those re" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [10:31:28] (03CR) 10Effie Mouzeli: [C:03+1] rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [10:31:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [10:31:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [10:31:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2048.codfw.wmnet [10:32:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:33:32] (03CR) 10JMeybohm: [C:03+2] haproxy ipblocks-all: Filter disabled ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/1190274 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [10:33:45] (03CR) 10Clément Goubert: "Thanks for the review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [10:34:07] !log Created `global_block_whitelist` on thwikimedia - T400001 [10:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:13] T400001: Create a Wiki for Wikimedia Thailand - https://phabricator.wikimedia.org/T400001 [10:36:05] (03CR) 10Effie Mouzeli: [C:03+1] "Cheers, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [10:36:21] jmm@cumin2002 drain-node (PID 3112619) is awaiting input [10:37:16] (03PS1) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) [10:37:20] dse-k8s-etcd2002 is going down for a Ganeti reboot [10:37:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet [10:37:37] (03CR) 10JMeybohm: [C:03+1] aptrepo: add kubernetes 1.23 support to Trixie Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1192087 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [10:39:18] PROBLEM - Host dse-k8s-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:40:50] RECOVERY - Host dse-k8s-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.68 ms [10:42:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet [10:42:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2048.codfw.wmnet [10:43:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2049.codfw.wmnet [10:44:31] (03PS10) 10Brouberol: airflow: automatically figure out some values to reduce release config size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) [10:45:52] (03CR) 10Brouberol: [C:03+1] Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) (owner: 10Btullis) [10:46:25] aux-k8s-etcd2004 is going down for a Ganeti reboot [10:46:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet [10:48:22] PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:57] (03CR) 10Btullis: airflow: automatically figure out some values to reduce release config size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol) [10:50:30] RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.78 ms [10:51:06] (03CR) 10Brouberol: airflow: automatically figure out some values to reduce release config size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol) [10:51:24] (03CR) 10Elukey: [C:03+2] aptrepo: add kubernetes 1.23 support to Trixie Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1192087 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [10:52:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet [10:52:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2049.codfw.wmnet [10:52:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2050.codfw.wmnet [10:52:36] (03PS2) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) [10:52:43] (03CR) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) (owner: 10Btullis) [10:53:07] !log upgrade Envoy on an-web1001 T403663 [10:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:12] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [10:54:46] (03CR) 10CI reject: [V:04-1] Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) (owner: 10Btullis) [10:55:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet [10:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:09] !log pooling wdqs2021 and wdqs2011 (caught up on lag) [10:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:11] (03CR) 10Jon Harald Søby: [C:04-1] "Like I mentioned in the other patch, these namespace aliases need to be added to that file. But you can re-use this patch to change the co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [11:00:38] (03PS3) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) [11:00:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet [11:00:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2050.codfw.wmnet [11:01:53] (03PS1) 10Kosta Harlan: UIC: Disable external permission check for Active wikis section [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192097 (https://phabricator.wikimedia.org/T405889) [11:03:12] jouncebot: nowandnext [11:03:12] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [11:03:12] In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1300) [11:03:31] (03PS4) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) [11:07:28] (03PS5) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) [11:07:45] (03PS1) 10Kosta Harlan: SI: Fix sorting by status [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192098 (https://phabricator.wikimedia.org/T405605) [11:09:55] (03CR) 10Btullis: [C:03+2] Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) (owner: 10Btullis) [11:10:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192098 (https://phabricator.wikimedia.org/T405605) (owner: 10Kosta Harlan) [11:10:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192097 (https://phabricator.wikimedia.org/T405889) (owner: 10Kosta Harlan) [11:10:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:59] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1235.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:14:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1235.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:14:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [11:15:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:22:24] (03Merged) 10jenkins-bot: SI: Fix sorting by status [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192098 (https://phabricator.wikimedia.org/T405605) (owner: 10Kosta Harlan) [11:22:26] (03Merged) 10jenkins-bot: UIC: Disable external permission check for Active wikis section [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192097 (https://phabricator.wikimedia.org/T405889) (owner: 10Kosta Harlan) [11:22:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:23:10] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192098|SI: Fix sorting by status (T405605)]], [[gerrit:1192097|UIC: Disable external permission check for Active wikis section (T405889)]] [11:23:18] T405605: Suggested investigations: Sorting by status doesn't always work - https://phabricator.wikimedia.org/T405605 [11:23:19] T405889: Disable external permissions check in UserInfoCard - https://phabricator.wikimedia.org/T405889 [11:23:32] !log pooling wdqs2007 (caught up on lag) [11:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:30:34] btullis@cumin1003 reimage (PID 484801) is awaiting input [11:30:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 13Patch-For-Review: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11224233 (10BTullis) a:05BTullis→03None Should be good to go. Thanks. [11:31:11] (03CR) 10JMeybohm: [C:03+1] Update eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1191652 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto) [11:32:02] (03CR) 10JMeybohm: [C:03+1] admin_ng: Change eqiad pod ip range to 10.67.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto) [11:36:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11224277 (10BTullis) That worked well, it seems. ` btullis@cumin1003:~$ sudo cumin 'an-worker[1209-1232].eqiad.wmnet' 'perccli64 /c0 add vd each r0 wb ra' 24 hosts will be targeted:... [11:38:10] (03CR) 10JMeybohm: [C:04-1] Update eqiad to kubernetes 1.31, calico 3.29 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [11:40:09] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [11:41:56] (03Merged) 10jenkins-bot: rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [11:42:03] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable bucket sync on production host [puppet] - 10https://gerrit.wikimedia.org/r/1192060 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [11:42:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11224309 (10BTullis) 05Open→03Resolved I think we can resolve this now. I have created T405903 to track adding these hosts to the cluster. [11:42:48] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:42:58] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:47:29] (03PS2) 10Jelto: gitlab: enable object storage for packages [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922) [11:49:52] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192098|SI: Fix sorting by status (T405605)]], [[gerrit:1192097|UIC: Disable external permission check for Active wikis section (T405889)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:50:01] T405605: Suggested investigations: Sorting by status doesn't always work - https://phabricator.wikimedia.org/T405605 [11:50:02] T405889: Disable external permissions check in UserInfoCard - https://phabricator.wikimedia.org/T405889 [11:52:50] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7083/co" [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [11:54:16] kostajh: Suggested investigation fix works as expected [11:54:24] Dreamy_Jazz: thanks [11:54:49] !log kharlan@deploy2002 kharlan: Continuing with sync [12:01:54] (03PS1) 10Giuseppe Lavagetto: requestctl_rules_file: fix path for non-cache hit scopes [puppet] - 10https://gerrit.wikimedia.org/r/1192105 [12:03:06] (03CR) 10JMeybohm: Update eqiad to k8s 1.31 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.963s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:04:16] (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl_rules_file: fix path for non-cache hit scopes [puppet] - 10https://gerrit.wikimedia.org/r/1192105 (owner: 10Giuseppe Lavagetto) [12:06:41] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable object storage for packages [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [12:07:19] (03PS5) 10Stevemunene: druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) [12:07:21] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192098|SI: Fix sorting by status (T405605)]], [[gerrit:1192097|UIC: Disable external permission check for Active wikis section (T405889)]] (duration: 44m 11s) [12:07:30] T405605: Suggested investigations: Sorting by status doesn't always work - https://phabricator.wikimedia.org/T405605 [12:07:31] T405889: Disable external permissions check in UserInfoCard - https://phabricator.wikimedia.org/T405889 [12:09:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:50] (03PS1) 10Btullis: Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) [12:11:16] (03CR) 10CI reject: [V:04-1] Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [12:12:43] (03CR) 10JMeybohm: [C:03+1] taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert) [12:13:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:15:52] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11224409 (10Jelto) [12:18:17] (03PS11) 10Arnaudb: gerrit: bugfixes on failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) [12:18:17] (03CR) 10Arnaudb: [C:03+2] "wording fixed" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:19:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:27:23] (03Merged) 10jenkins-bot: gerrit: bugfixes on failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:28:01] FIRING: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:35:22] (03PS2) 10Jelto: Update eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) [12:37:33] (03CR) 10Jelto: Update eqiad to kubernetes 1.31, calico 3.29 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [12:40:00] (03PS1) 10Brouberol: Define the kafka-mirrromaker kubeconfigs in dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1192117 (https://phabricator.wikimedia.org/T304373) [12:40:02] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1235.eqiad.wmnet with OS bullseye [12:40:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [12:41:00] (03PS2) 10Jelto: Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) [12:41:35] (03PS1) 10Brouberol: kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373) [12:41:57] (03CR) 10Jelto: Update eqiad to k8s 1.31 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [12:44:20] (03PS1) 10Btullis: Add new dummy keytabs for an-launcher1003 [labs/private] - 10https://gerrit.wikimedia.org/r/1192120 (https://phabricator.wikimedia.org/T402943) [12:45:05] (03CR) 10Btullis: [V:03+2 C:03+2] Add new dummy keytabs for an-launcher1003 [labs/private] - 10https://gerrit.wikimedia.org/r/1192120 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [12:45:39] (03PS2) 10Brouberol: kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373) [12:46:05] (03PS8) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [12:47:02] (03PS3) 10Brouberol: kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373) [12:47:02] (03PS9) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [12:47:43] (03CR) 10CI reject: [V:04-1] kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [12:48:31] (03PS2) 10Btullis: Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) [12:49:54] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1236.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:50:02] (03PS10) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [12:50:05] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7086/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [12:51:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1236.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:54:16] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1235.eqiad.wmnet with reason: host reimage [12:56:09] (03PS9) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) [12:56:12] (03PS4) 10Brouberol: kafka-mirrormaker: initial scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192109 (https://phabricator.wikimedia.org/T304373) [12:57:55] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye [12:59:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1235.eqiad.wmnet with reason: host reimage [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1300). [13:00:05] MatmaRex, xSavitar, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:17] o/ [13:00:20] I can deploy! [13:00:21] hi [13:00:25] thanks Lucas_WMDE :) [13:00:30] Lucas_WMDE okay sir 🙏🏽, thanks [13:00:41] xSavitar: do you want to self-service your deployment? [13:00:57] oh, you've improved my backport note, thanks [13:00:58] Lucas_WMDE, you can deploy it, I'll test :) [13:01:02] ok [13:01:05] let’s start with that [13:01:10] and run gate-and-submit for the backport in the meantime [13:01:11] Okay [13:01:13] MatmaRex: yeah :) [13:01:21] thought it’d be useful in case someone else ended up running it ^^ [13:01:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:03:01] RESOLVED: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:03:26] feel like that config change is taking longer than usual to merge o_O [13:03:31] what is tox doing https://integration.wikimedia.org/ci/job/operations-mw-config-tox/9024/console [13:03:31] (03Merged) 10jenkins-bot: session: Enable MultiBackendSessionStore on `group0` wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:03:40] Lucas_WMDE maybe you spoke too soon? [13:03:51] :) [13:03:52] not really, that was still longer than I would expect ^^ [13:03:53] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1183132|session: Enable MultiBackendSessionStore on `group0` wikis (T402808)]] [13:04:00] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:04:37] e.g. on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1191514 that build took 39s instead of 1m44s [13:04:47] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:05:00] Hm! Interesting... looking... [13:06:15] Lucas_WMDE not sure what is the issue but if it persists maybe we can file a Phab task or ask around if something has changed recently (since last week)? [13:06:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] FixRenameUserLocalLogs: Improve matching for users renamed multiple times (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:07:46] xSavitar: I think it’s fine to just leave it for now [13:07:57] Could it also be because this is the first deploy for the week? :) [13:08:00] the other recent builds at https://integration.wikimedia.org/ci/job/operations-mw-config-tox/ were all faster [13:08:38] * xSavitar will hunt down first deploys for the week and see if there are any clues. [13:09:19] Lucas_WMDE, Ack! I just looked at random patches last week and they seem to run for about 40s max [13:09:43] !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde: Backport for [[gerrit:1183132|session: Enable MultiBackendSessionStore on `group0` wikis (T402808)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:49] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:10:03] Should I test? [13:10:35] yes please :) [13:10:43] Ack! Testing now... [13:12:10] (03CR) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Improve matching for users renamed multiple times (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:13:12] (03CR) 10Ssingh: "Looks good but can also be removed from modules/profile/data/profile/installserver/preseed.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [13:13:42] (03CR) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Improve matching for users renamed multiple times (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:13:48] (03CR) 10Lucas Werkmeister (WMDE): "retracting +2 in case we want to change something" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:14:19] Tested so far on mediawikiwiki, officewiki and testwikidatawiki and everything seems to work fine. [13:14:23] Lucas_WMDE, you can sync, thank you [13:14:37] !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde: Continuing with sync [13:14:38] thanks! [13:14:40] Lucas_WMDE: i guess we can backport the followup too, thanks for spotting that [13:14:45] ok, sounds good [13:15:15] MatmaRex: do you have someone around who can CR+2 the follow-up on master? [13:15:20] or should I be brave and do it? ^^ [13:15:24] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:15:36] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:15:44] xSavitar, possibly ;) can you have a look at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1192128 ? [13:16:16] MatmaRex looking... [13:17:53] Responding to Lucas' comment otherwise looks fine. [13:18:02] *Responded [13:18:24] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:18:44] sure, done [13:18:46] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:21:07] I quickly added Bug: T398177 to the commit message before xSavitar +2s it ;) [13:21:08] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:21:25] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1236.eqiad.wmnet with OS bullseye [13:21:39] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Ensure field subquery returns just 1 result [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192130 (https://phabricator.wikimedia.org/T398177) [13:21:39] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:21:45] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1183132|session: Enable MultiBackendSessionStore on `group0` wikis (T402808)]] (duration: 17m 52s) [13:21:51] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:22:02] Lucas_WMDE, Ack! [13:22:03] thanks. and that's the backport [13:22:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:22:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1235.eqiad.wmnet with OS bullseye [13:23:22] oh, and I guess we should backport that second one to wmf.21 also [13:23:23] wait [13:23:29] no. hasn’t been branched yet ^^ [13:23:47] Lucas_WMDE, right, it hasn't been branched yet: https://versions.toolforge.org/ [13:23:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:23:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192130 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:24:07] I suspect the first one will merge very quickly thanks to the CI success result cache [13:24:15] the second one will still need a full gate-and-submit though [13:25:29] Lucas_WMDE, wait, so you mean when gate-and-submit runs, the results get cached and even after it gets interrupted, then retriggered, it doesn't do a full run? Nice. What if the patch changed in the meantime? [13:25:53] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Improve matching for users renamed multiple times [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:25:59] ^^ [13:26:09] the cache is only used if it’s the same Git commit being tested [13:26:23] both in the repo to which the test belongs and also in all dependent repositories, I believe [13:26:33] Nice, that's a neat feature. Kudos to the CI/CD lords around here. [13:26:35] so no the master branch it’s quite rare to see a cache hit afaik [13:26:39] *on the master branch [13:26:50] because by the time you try the gate-and-submit again, something else probably got merged already [13:26:51] Ack [13:26:53] but it’s useful for backport branches [13:26:56] https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/KTP34HIR5D66QLGHC3ZAIZKQWE46O5F4/ was the announcement [13:27:27] thanks for the link. Will read. [13:27:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye [13:29:02] I’ll do my deployment-charts change in parallel, should have no effect on each other [13:29:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191692 (https://phabricator.wikimedia.org/T405720) (owner: 10Lucas Werkmeister (WMDE)) [13:30:49] (03PS10) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [13:31:11] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191692 (https://phabricator.wikimedia.org/T405720) (owner: 10Lucas Werkmeister (WMDE)) [13:31:13] i need to step away for a bit, i'll be back in 15 minutes or so [13:31:19] ok [13:33:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:33:55] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:34:28] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:34:37] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [13:35:05] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [13:35:13] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Ensure field subquery returns just 1 result [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192130 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:35:37] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1191495|FixRenameUserLocalLogs: Improve matching for users renamed multiple times (T398177)]], [[gerrit:1192130|FixRenameUserLocalLogs: Ensure field subquery returns just 1 result (T398177)]] [13:35:40] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [13:35:44] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:36:01] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [13:36:41] xSavitar, while MatmaRex is afk: there’s nothing to test on mwdebug for those two backports, right? [13:36:45] since they only affect a maintenance script [13:38:25] Yes sir! [13:38:48] MatmaRex plans to do a dry run, investigate the output and then kicks the script again afterwards [13:38:58] So, I think you can sync [13:40:02] yeah, I’ll just do the dry run after the sync is done [13:40:11] Okay [13:40:35] * Lucas_WMDE is done with the deployment-charts kubernetes deploy ftr [13:41:36] !log lucaswerkmeister-wmde@deploy2002 matmarex, lucaswerkmeister-wmde: Backport for [[gerrit:1191495|FixRenameUserLocalLogs: Improve matching for users renamed multiple times (T398177)]], [[gerrit:1192130|FixRenameUserLocalLogs: Ensure field subquery returns just 1 result (T398177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:42] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:42:00] !log lucaswerkmeister-wmde@deploy2002 matmarex, lucaswerkmeister-wmde: Continuing with sync [13:42:39] btullis@cumin1003 reimage (PID 502318) is awaiting input [13:43:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:44:51] (03PS3) 10Btullis: Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) [13:46:09] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7087/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [13:46:48] (03CR) 10Brouberol: [C:03+1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:47:01] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191495|FixRenameUserLocalLogs: Improve matching for users renamed multiple times (T398177)]], [[gerrit:1192130|FixRenameUserLocalLogs: Ensure field subquery returns just 1 result (T398177)]] (duration: 11m 24s) [13:47:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11224627 (10MatthewVernon) Hi @VRiley-WMF do you think you'll be able to do these swaps this week, please? [13:47:11] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:48:03] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist sul CentralAuth:FixRenameUserLocalLogs --logwiki=metawiki # T398177 (dry run) [13:48:22] !log UTC afternoon backport+config window done (CentralAuth:FixRenameUserLocalLogs maintenance script will keep running for a few hours) [13:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:34] awight: ^ if you wanted to deploy something [13:48:38] (03CR) 10Btullis: [C:04-1] "Do we need to vendor these modules in the operator chart at all?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:48:42] Lucas_WMDE, thank you very much for deploying. [13:49:36] yw :) [13:51:08] Lucas_WMDE: are you doing the deployment-charts now, and if so may I sneak in a few minutes of mw maintenance script run? [13:51:57] I already did the deployment-charts [13:52:04] I’m running a maintenance script but I assume you can run another one [13:52:14] (the one for T398177 will take some more hours) [13:52:15] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:53:04] Lucas_WMDE: okay I'll go ahead and try that. Just let me know if it seems to cause problems. I think this is will take 10-60s, to purge 500 or so pages. [13:54:03] sounds good [13:55:03] * MatmaRex back [13:55:29] thanks Lucas_WMDE [13:55:57] np [13:57:22] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ms-be[1086-1088].eqiad.wmnet with reason: awaiting controller swap [13:57:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11224717 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f05e1660-c13c-4689-a96d-eaccf6967088) set by mvernon@cu... [13:58:08] !log awight@deploy2002 mwscript-k8s job started: purgePage.php --wiki=dewiki # T389363 [13:58:14] T389363: Fix attribute order round-tripping for sub-references (dirty diff) - https://phabricator.wikimedia.org/T389363 [13:59:06] Lucas_WMDE: All done, good luck with your longer run! [14:00:25] thanks! [14:02:55] 06SRE, 06Traffic: "Backend fetch failed" on edit save - https://phabricator.wikimedia.org/T382790#11224731 (10ssingh) 05Open→03Resolved a:03ssingh This has been open for a while and there hasn't been any follow up from either side. @MGChecker: Please re-open if this issue still persists for you. Thanks! [14:05:55] (03CR) 10Tiziano Fogli: [C:03+2] mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [14:06:46] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1236.eqiad.wmnet with OS bullseye [14:07:14] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye [14:07:18] MatmaRex: FWIW, so far I think I’m seeing the same number of “User has existed, but no local log entry” output rows [14:07:33] but the three “More than one matching local log entry for global” ones from abwiki went away [14:07:46] actually, the very last “User has existed, but no local log entry for global #49887933” line on abwiki went away too [14:07:53] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917 (10Maria_Lechner_WMDE) 03NEW [14:08:11] nice [14:09:05] (03PS18) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [14:09:20] 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#11224783 (10ssingh) There has been no follow-up on this after Jun 2024. @MatthewVernon: should we keep this open? [14:09:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405843#11224785 (10phaultfinder) [14:10:29] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:10:37] (03PS1) 10Marco Fossati: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) [14:11:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [14:11:58] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#11224790 (10ssingh) 05Open→03Resolved a:03ssingh We have made progress in T301605, and specific to this task, we ramped up tra... [14:13:40] there’s still some “More than one matching local log entry for global” though, amwiki has two [14:14:17] (03CR) 10Joal: Replace old sqoop wiki list file with new autoupdated file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191750 (owner: 10Snwachukwu) [14:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [14:14:52] MatmaRex: “amwiki Would update performer for local #101754 based on global #59763297 from 'Nahomnata' to 'J ansari'” [14:15:03] does that sound right? I thought the second part of https://phabricator.wikimedia.org/T398177#11146083 meant these shouldn’t have happened 🤔 [14:15:17] (feel free to wait with the answer until it’s done and I’ve posted the full logs, of course :P) [14:15:35] (there’s one other “would update” in amwiki, from ) [14:16:01] in a meeting right now, i'll look later [14:17:03] (03CR) 10Phuedx: "@kharlan@wikimedia.org: Yes. I'm a little unsure as to why you're doing this here rather than in the ConfirmEdit extension but I'm out of " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [14:18:48] (03CR) 10Reedy: "Avoiding putting WMF specific stuff (assumptions etc) into a bundled/tarballed extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [14:18:52] ok :) [14:19:52] (03CR) 10Reedy: WIP hCaptcha: Enable A/B test for frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [14:21:13] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1236.eqiad.wmnet with reason: host reimage [14:21:36] (03CR) 10Kosta Harlan: "Yes, what @reedy@wikimedia.org said -- this is WMF specific, so the logic belongs here (for lack of better place, see also T401939)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [14:21:48] (03PS1) 10Scott French: haproxy acl naming refactor and minor UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192145 [14:23:33] !log cgoubert@cumin1003 START - Cookbook sre.discovery.service-route check toolhub: maintenance [14:23:33] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check toolhub: maintenance [14:23:59] (03PS3) 10Kosta Harlan: hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) [14:24:11] (03PS6) 10Stevemunene: druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) [14:24:11] (03PS1) 10Stevemunene: remove mention of druid10[07-08] in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1192147 (https://phabricator.wikimedia.org/T403801) [14:24:17] (03PS1) 10Kosta Harlan: Hooks: Enable overriding the hook instance per action [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192148 (https://phabricator.wikimedia.org/T405239) [14:24:23] (03CR) 10CI reject: [V:04-1] hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [14:24:38] (03CR) 10Ssingh: [C:03+1] druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [14:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:24:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1236.eqiad.wmnet with reason: host reimage [14:25:07] (03CR) 10Ssingh: [C:03+1] "Actually sorry, my bad. site.pp can be updated as well if you want but +1." [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [14:26:03] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11224867 (10WMDE-leszek) I approve this request on WMDE's end. Thank you! [14:26:11] (03CR) 10Scott French: [V:03+2] "Tested locally at 5a6390f" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192145 (owner: 10Scott French) [14:26:18] (03CR) 10Stevemunene: "No worries, I added another patch for that I3bca4291156dec81bb03d37eb66d1a9a5aa3cab4." [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [14:27:02] (03CR) 10Ssingh: [C:03+1] remove mention of druid10[07-08] in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1192147 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [14:28:25] (03PS2) 10Brouberol: Define the kafka-mirrromaker kubeconfigs in all dse-k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1192117 (https://phabricator.wikimedia.org/T304373) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1430) [14:30:53] (03CR) 10Scott French: [V:03+2 C:03+2] haproxy acl naming refactor and minor UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192145 (owner: 10Scott French) [14:31:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192148 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [14:33:37] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy haproxy acl naming refactor and minor UI improvements - swfrench@cumin2002" [14:33:39] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy haproxy acl naming refactor and minor UI improvements - swfrench@cumin2002 [14:34:33] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy haproxy acl naming refactor and minor UI improvements - swfrench@cumin2002 [14:34:35] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy haproxy acl naming refactor and minor UI improvements - swfrench@cumin2002" [14:34:35] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11224891 (10Jclark-ctr) @VRiley-WMF T404103 optics have arrived and are or cart between rows C/D. Please connect all the cables you preran and update CableIDs in Netbox https://netbox.wiki... [14:35:41] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11224893 (10Jclark-ctr) @cmooney all fibers for ssw1-d1-eqiad have been connected except cr1-eqiad ,ssw1-e1-eqiad ,ssw1-f1-eqiad [14:38:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno) [14:39:16] 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#11224904 (10MatthewVernon) I guess not, if it has recurred, it's not been enough to page... [14:39:58] (03PS19) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [14:40:11] 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#11224908 (10ssingh) 05Open→03Resolved a:03ssingh OK thank you. I am marking this as resolved for now. We can re-open as required. [14:41:24] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:43:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11224919 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [14:45:06] (03PS11) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [14:45:51] (03PS20) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [14:45:55] (03PS12) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [14:46:59] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [14:47:32] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:48:39] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11224950 (10Jelto) [14:49:10] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11224955 (10Jelto) [14:50:03] btullis@cumin1003 reimage (PID 505620) is awaiting input [14:51:13] (03CR) 10CDanis: [C:03+1] admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [14:51:32] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11224972 (10elukey) p:05Triage→03Medium [14:51:53] (03CR) 10CDanis: [C:03+1] admin/data: add the analytics-wikidata system user and user groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [14:52:11] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: DeprecationWarning: datetime.datetime.utcnow() is deprecated - https://phabricator.wikimedia.org/T401581#11224976 (10elukey) p:05Triage→03Low [14:52:24] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: DeprecationWarning: datetime.datetime.utcnow() is deprecated - https://phabricator.wikimedia.org/T401581#11224978 (10elukey) [14:55:43] 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11224991 (10JTweed-WMF) [14:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:55] RECOVERY - snapshot of s4 in eqiad on backupmon1001 is OK: Last snapshot for s4 at eqiad (db1245) taken on 2025-09-29 13:24:43 (1904 GiB, +2.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:02:11] !log stevemunene@puppetserver1001 conftool action : set/pooled=no; selector: service=(druid-public-broker),name=druid1007.eqiad.wmnet [15:02:30] !log stevemunene@puppetserver1001 conftool action : set/pooled=no; selector: service=(druid-public-broker),name=druid1008.eqiad.wmnet [15:03:29] (03CR) 10Btullis: [C:03+1] Define the kafka-mirrromaker kubeconfigs in all dse-k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1192117 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [15:03:42] (03CR) 10Btullis: [C:03+1] kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [15:04:34] 06SRE, 06serviceops, 06Traffic: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800#11225029 (10ssingh) This is being moved on the Traffic workboard to "Radar/Not for service" as I don't think there is anything on our end to do here. Please let me know if you... [15:04:42] (03CR) 10Btullis: [C:03+1] kafka-mirrormaker: initial scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192109 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [15:04:54] 06SRE, 06serviceops, 06Traffic: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800#11225030 (10ssingh) And to be clear, by that I mean that this change is better suited for MW and not the CDN. [15:06:16] (03CR) 10Btullis: [C:03+1] kafka-mirrormaker: define business logic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [15:07:53] (03CR) 10Btullis: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [15:08:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11225041 (10Jhancock.wm) Update on the cp2056. Finally got Dell to agree to send a replacement card after a week of back of forth and escalations. So that shoul... [15:08:39] PROBLEM - Druid broker on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:08:39] PROBLEM - Druid coordinator on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:08:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:08:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:08:41] PROBLEM - Druid overlord on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:09:05] (03CR) 10Stevemunene: [C:03+2] druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [15:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:16] ^ stevemunene is removing druid1007-8 [15:09:17] so this might be that [15:10:21] (03CR) 10Brouberol: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [15:10:39] PROBLEM - Druid overlord on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:10:39] PROBLEM - Druid historical on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:10:39] PROBLEM - Druid broker on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:10:39] PROBLEM - Druid coordinator on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:10:41] PROBLEM - Druid middlemanager on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:11:34] Druid errors are from Decommissioning druid services on druid100[7-8] for T403801 [15:11:35] T403801: decommission druid100[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T403801 [15:13:03] (03CR) 10Ahmon Dancy: "The addition of the profile::puppetserver::volatile::cdn_private_git_token lookup has broken puppet on deployment-puppetserver-1.deploymen" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [15:14:32] (03PS13) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) [15:14:35] (03CR) 10Brouberol: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [15:16:30] 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11225074 (10dancy) Noting that puppet on `deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud` is currently broken due to the addition of the `profile::puppetserver::volatile::cdn_private_git_token` look... [15:17:02] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 10Spicerack: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397#11225075 (10Scott_French) 05Open→03Resolved a:03Scott_French Amazing - thank you very much,... [15:19:35] (03CR) 10Brouberol: [C:03+2] Define the kafka-mirrromaker kubeconfigs in all dse-k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1192117 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol) [15:22:12] (03CR) 10Btullis: [C:03+1] airflow: automatically figure out some values to reduce release config size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol) [15:24:31] !log tappof@deploy2002 Started deploy [performance/navtiming@94fa387]: Add authenticated mw_context values [15:24:33] !log tappof@deploy2002 Finished deploy [performance/navtiming@94fa387]: Add authenticated mw_context values (duration: 00m 02s) [15:24:39] RECOVERY - Druid coordinator on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:24:39] RECOVERY - Druid broker on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:24:39] RECOVERY - Druid historical on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:24:39] RECOVERY - Druid overlord on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:24:41] RECOVERY - Druid middlemanager on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:26:01] !log tappof@deploy2002 Started deploy [performance/navtiming@94fa387]: Add authenticated mw_context values [15:26:04] !log tappof@deploy2002 Finished deploy [performance/navtiming@94fa387]: Add authenticated mw_context values (duration: 00m 02s) [15:26:40] !log tappof@deploy2002 Started restart [performance/navtiming@94fa387]: Add authenticated mw_context values [15:27:19] (03PS21) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [15:28:35] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:29:12] 06SRE, 06Traffic, 13Patch-For-Review, 07Wikimedia-Performance-recommendation: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911#11225132 (10ssingh) What is the update on this, given that it has been a while and I am a bit confused reading the text and trying to... [15:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1530). [15:33:35] !log tappof@deploy2002 Started deploy [performance/navtiming@94fa387]: Add authenticated mw_context values [15:33:37] !log tappof@deploy2002 Finished deploy [performance/navtiming@94fa387]: Add authenticated mw_context values (duration: 00m 02s) [15:34:03] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11225146 (10elukey) We have configured Tegola and Kartotherian in prod-codfw to use the new postgres stack, but I am seeing some errors in Kartotherian like the following: ` {"name":... [15:34:51] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:39] RECOVERY - Druid coordinator on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:35:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:35:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:35:39] RECOVERY - Druid broker on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:35:41] RECOVERY - Druid overlord on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:36:17] !log tappof@deploy2002 Started deploy [performance/navtiming@94fa387]: Add authenticated mw_context values [15:36:19] !log tappof@deploy2002 Finished deploy [performance/navtiming@94fa387]: Add authenticated mw_context values (duration: 00m 02s) [15:36:45] 06SRE, 06Traffic: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431#11225153 (10ssingh) 05Open→03Resolved a:03ssingh We have had this for a while and the responses are padded. Marking as resolved. [15:37:54] (03PS22) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [15:38:21] 06SRE, 06Traffic: Performance implications of buffer sizes in Apache Traffic Server intercept plugins - https://phabricator.wikimedia.org/T287847#11225162 (10ssingh) 05Open→03Resolved a:03ssingh This was merged upstream in 9.2.x so we have inherited this change. Since we have not revisited this since... [15:38:30] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11225165 (10elukey) I applied `/usr/local/bin/maps-grants-gis.sql` on maps2011 and now the grants are better: ` gis=# SELECT * FROM information_schema.role_table_grants where grantee... [15:39:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:17] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:39:28] (03PS23) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [15:40:04] !log dancy@deploy2002 Installing scap version "4.211.0" for 168 host(s) [15:40:45] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:41:59] !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values [15:42:02] !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 02s) [15:43:02] !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values [15:43:05] !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 02s) [15:44:03] !log tappof@deploy2002 Started restart [performance/navtiming@578b1d3]: Add authenticated mw_context values [15:44:08] !log dancy@deploy2002 Installation of scap version "4.211.0" completed for 168 hosts [15:44:10] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 07Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009#11225214 (10ssingh) I am curious: should we keep this open or should this be resolved now given that we have... [15:44:49] (03CR) 10Ssingh: [C:03+1] "@cdanis@wikimedia.org: I guess we should merge this today; let me know and happy to take care of that." [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [15:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:44:56] (03CR) 10Ssingh: [C:03+1] haproxy: use Lua 5.3 for Docker tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis) [15:45:00] !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values [15:45:02] !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 02s) [15:45:21] !log dancy@deploy2002 Started scap sync-world: Testing gitinfo fix (T405738) [15:45:27] T405738: Debug scap partial deployment, 25 Sept 2025 - https://phabricator.wikimedia.org/T405738 [15:46:31] !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values [15:46:41] !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 15s) [15:48:02] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11225236 (10ABran-WMF) a:03ABran-WMF [15:49:13] (03CR) 10Elukey: [C:03+1] "Very sad that nowadays we see these huge amount of yaml, but we cannot really do anything differently :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [15:51:28] !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values [15:51:30] !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 02s) [15:51:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225259 (10VRiley-WMF) Hey @MatthewVernon Yes, I am planning on doing this today. I apologize as I was out for two days last week. [15:52:26] !log tappof@deploy2002 Started restart [performance/navtiming@578b1d3]: Add authenticated mw_context values [15:52:29] sukhe: please feel free to merge both of those patches if you like :D [15:52:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11225279 (10Papaul) [15:52:33] I can also get around to it soon, in a meeting now [15:53:14] cdanis: happy to take care of them [15:53:33] (03CR) 10Ssingh: [C:03+2] taskgen: add haproxy Lua tests [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [15:54:24] !log restart haproxy on cp5021 to test utf8ps converter [15:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:42] (03CR) 10Ssingh: [C:03+2] haproxy: use Lua 5.3 for Docker tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis) [15:55:27] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5021.eqsin.wmnet [15:56:23] (03CR) 10Ssingh: "Thanks for reporting the broken CI, unrelated to this change. @cdanis@wikimedia.org fixed this and that change has been merged. Can you pl" [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert) [15:56:37] !log dancy@deploy2002 Finished scap sync-world: Testing gitinfo fix (T405738) (duration: 11m 16s) [15:56:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:56:44] T405738: Debug scap partial deployment, 25 Sept 2025 - https://phabricator.wikimedia.org/T405738 [15:56:57] (03PS2) 10Jelto: Update eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1191652 (https://phabricator.wikimedia.org/T375845) [15:57:10] (03PS3) 10Jelto: Update eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) [15:57:27] (03PS2) 10Clément Goubert: taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) [15:57:38] (03PS5) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) [15:57:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:58:31] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet [15:59:38] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225331 (10MatthewVernon) Cool, thanks :) [16:00:42] (03CR) 10Clément Goubert: "Thank you both for fixing it!" [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert) [16:00:46] (03CR) 10Elukey: [C:03+1] "I didn't find anything strange, it was a lot of yaml but I didn't find:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [16:02:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:46] (03PS4) 10Kosta Harlan: hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) [16:04:24] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11225371 (10elukey) @Jhancock.wm me and Jesse are running out of ideas, if you have time could you please open the host and check if the bus between the BMC and the motherboard et... [16:05:18] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [16:05:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11225378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet wi... [16:06:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:06:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:07:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11225393 (10Papaul) I had a meeting today with @Jgreen about the new switch configuration. what we will be doing is to move the... [16:07:20] !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048'] [16:07:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2048'] [16:09:21] !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048'] [16:09:37] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048'] [16:09:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:11:13] (03PS24) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [16:11:25] !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048'] [16:11:37] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048'] [16:12:20] (03PS25) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [16:13:44] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:13:50] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:14:00] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:14:26] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:14:37] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:16:13] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405843#11225482 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there are 6 servers on the EOL list in this rack. removing thresholds and adding to tracking task [16:18:47] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b3-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405755#11225505 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there are at least 6 servers in this rack that are on the EOL list. Removing alerting and addin... [16:20:00] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11225508 (10Jhancock.wm) [16:23:47] RESOLVED: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:13] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11225571 (10MatthewVernon) @elukey re the triage priority - if there's a problem with our standard UEFI setup for re-imaging ms* node... [16:25:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11225573 (10elukey) The firmware cookbook doesn't work yet since spicerack is configured to look for a `HttpPushUri` field in the Redfish's UpdateService endpoi... [16:28:07] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11225583 (10elukey) >>! In T404356#11225571, @MatthewVernon wrote: > @elukey re the triage priority - if there's a problem with our s... [16:28:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:29:04] 10ops-eqsin: Inbound errors on interface cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) - https://phabricator.wikimedia.org/T405938 (10phaultfinder) 03NEW [16:29:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:35:38] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940 (10RobH) 03NEW [16:36:15] (03PS26) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [16:36:38] (03PS1) 10Papaul: Add new Frack switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1192171 (https://phabricator.wikimedia.org/T405618) [16:36:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:36:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:36:46] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11225644 (10RobH) a:03LSobanski @lsobanski, I'm not exactly sure who in your team should be the point of contact for the migration of these hosts (list... [16:37:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 (10RobH) 03NEW [16:38:17] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:39:02] (03PS2) 10Papaul: Add new Frack switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1192171 (https://phabricator.wikimedia.org/T405618) [16:39:45] (03CR) 10Harroyo-wmf: [C:03+1] hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [16:39:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11225664 (10RobH) a:03KOfori @kofori, I'm assigning this to you as team manager for feedback on who I should work with as the point of contact for the migration of... [16:41:06] (03PS2) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [16:41:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11225673 (10ssingh) Note that @KOfori is out, this should be directed to @Kappakayala in the meantime. [16:41:18] (03CR) 10Papaul: [C:03+2] Add new Frack switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1192171 (https://phabricator.wikimedia.org/T405618) (owner: 10Papaul) [16:41:46] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943 (10RobH) 03NEW [16:42:19] (03PS3) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [16:42:30] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:42:49] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:44:34] (03PS4) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [16:45:36] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:46:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945 (10RobH) 03NEW [16:46:43] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:48:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11225731 (10RobH) a:05RobH→03joanna_borun Joanna, I'm not exactly sure who on your team to assign this as point of contact, so I'm assigning... [16:49:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946 (10RobH) 03NEW [16:49:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11225747 (10CDanis) a:05joanna_borun→03LSobanski [16:51:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11225769 (10RobH) a:03herron @herron or @colewhite (not sure which of you is best to handle this, please reassign as needed!) I'm looking to get some feedback for the s... [16:51:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11225772 (10RobH) [16:52:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948 (10RobH) 03NEW [16:53:14] (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey) [16:53:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11225795 (10RobH) a:05RobH→03Gehel @gehel, I'm not sure who would be the best point of contact within Search SRE to coordinate with for the migration of the above... [16:56:02] (03CR) 10Bearloga: hCaptcha: Enable A/B test for frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [16:57:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:58:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:59:45] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11225841 (10RobH) a:03BTullis @btullis, After asking Guillaume he said I should work with you as point of contact for these migrations (though that you would still b... [17:00:04] swfrench-wmf and dancy: MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1700). Please do the needful. [17:00:04] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1700). [17:00:16] o/ [17:00:41] o/ [17:00:44] swfrench-wmf: The new release of scap is ready to deploy [17:01:09] great, I'm running some pre-flight checks to make sure there aren't any latent diffs that will get in our way [17:01:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11225848 (10RobH) a:05Gehel→03bking After irc chat with @gehel he suggested this should assign over to @bking for coordination (but it will still be discussed with... [17:01:18] 06SRE, 13Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460#11225852 (10Krinkle) [17:02:10] dancy: looks like we're good to go. feel free to go ahead and deploy scap, then we can run a stop-before-sync deploy as discussed. [17:02:43] OK. [17:03:06] !log dancy@deploy2002 Installing scap version "4.212.0" for 2 host(s) [17:03:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950 (10RobH) 03NEW [17:04:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Discovery-Search: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11225872 (10RobH) [17:04:55] !log dancy@deploy2002 Installation of scap version "4.212.0" completed for 2 hosts [17:05:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:05:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:05:56] !log dancy@deploy2002 Started scap sync-world: Testing T405110 [17:06:03] T405110: Allow the same namespace name to be used in different clusters - https://phabricator.wikimedia.org/T405110 [17:06:43] (03CR) 10Dr0ptp4kt: [C:03+1] ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [17:06:44] !log dancy@deploy2002 Stopping before sync operations [17:07:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11225895 (10RobH) @Kappakayala, I'm not exactly sure who in your team would be the best point of contact for the above migration list, as it covers multiple service groups. Th... [17:07:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11225896 (10RobH) a:03Kappakayala [17:08:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11225899 (10Jhancock.wm) @Papaul Hey we've gotten the pressed and site.pp files cofigured correctly as far as i can tell but still getting this o... [17:08:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11225900 (10Jhancock.wm) a:05Jhancock.wm→03Papaul [17:08:29] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11225901 (10RobH) [17:09:48] dancy: I'm looking at the updated contents of `/etc/helmfile-defaults/mediawiki/release` and I think this looks good [17:09:57] Agreed [17:10:22] alright, I'll merge your helmfile patch, and then run the diffs again [17:10:30] OK. Standing by [17:10:35] (03CR) 10Scott French: [C:03+2] mediawiki services: Update path to scap-created yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189905 (https://phabricator.wikimedia.org/T405110) (owner: 10Ahmon Dancy) [17:10:45] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:13:16] (03Merged) 10jenkins-bot: mediawiki services: Update path to scap-created yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189905 (https://phabricator.wikimedia.org/T405110) (owner: 10Ahmon Dancy) [17:13:38] (03CR) 10Scott French: [C:03+2] deployment_server: support environment in release values file name [puppet] - 10https://gerrit.wikimedia.org/r/1191751 (https://phabricator.wikimedia.org/T405110) (owner: 10Scott French) [17:14:36] papaul: good to merge your hieradata changes? [17:14:39] (03PS1) 10Dzahn: zuul: move new zuul nodepool setup to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938) [17:15:07] (03CR) 10CI reject: [V:04-1] zuul: move new zuul nodepool setup to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:15:40] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11225919 (10Jclark-ctr) Your dispatch shipped on 9/29/2025 11:56 AM [17:16:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225920 (10VRiley-WMF) 05Open→03In progress Starting work on ms-be1087 (will get to ms-be1086 in a bit. starting with the cage... [17:16:36] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:17:26] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [17:17:32] (03PS2) 10Dzahn: zuul: move new zuul nodepool setup to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938) [17:17:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11225925 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with O... [17:18:58] !log dancy@deploy2002 Started scap sync-world: Testing T405110 (v2) [17:19:05] T405110: Allow the same namespace name to be used in different clusters - https://phabricator.wikimedia.org/T405110 [17:26:18] !log dancy@deploy2002 Finished scap sync-world: Testing T405110 (v2) (duration: 07m 20s) [17:26:25] T405110: Allow the same namespace name to be used in different clusters - https://phabricator.wikimedia.org/T405110 [17:27:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:28:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:28:56] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11225940 (10Maria_Lechner_WMDE) I have not signed an NDA yet, I'm happy to receive the respective form/doc at maria.lechner AT wikimedia DOT de. [17:33:36] (03CR) 10Eric Gardner: [C:03+1] ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [17:34:02] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:35:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:35:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:36:44] dancy: alright, mwscript-k8s works as expected - I think we're done here :) [17:36:59] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:37:03] woohoo! Thanks for testing swfrench-wmf> [17:37:28] thanks for making this happen! :) [17:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225981 (10VRiley-WMF) [17:42:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225983 (10VRiley-WMF) Finished updating ms-be1087, moving onto ms-be1088 [17:48:03] (03PS1) 10Brouberol: opensearch-operator-crds: add a crds.yaml fixture file to point the CI to the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192185 (https://phabricator.wikimedia.org/T397246) [17:49:08] (03PS11) 10Krinkle: varnish: Enable unified mobile routing on wikimedia.org wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [17:49:38] (03CR) 10Superpes15: "Ack! many thanks :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [17:49:49] (03CR) 10Btullis: [C:03+1] opensearch-operator-crds: add a crds.yaml fixture file to point the CI to the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192185 (https://phabricator.wikimedia.org/T397246) (owner: 10Brouberol) [17:50:45] (03PS17) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [17:51:57] (03CR) 10Bearloga: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [17:53:42] (03CR) 10Bking: [C:03+2] opensearch-operator-crds: add a crds.yaml fixture file to point the CI to the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192185 (https://phabricator.wikimedia.org/T397246) (owner: 10Brouberol) [17:54:35] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [17:55:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:56:28] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [17:56:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:00:02] (03CR) 10Bking: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [18:00:26] (03PS27) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [18:02:05] (03CR) 10Brouberol: [C:03+2] airflow: automatically figure out some values to reduce release config size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol) [18:02:09] 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11226050 (10Prototyperspective) Videos used to start quickly and to load quickly. Since a short while they aren't anymore. Maybe I should ask on a Commons board whether other users also have... [18:04:29] (03Merged) 10jenkins-bot: airflow: automatically figure out some values to reduce release config size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol) [18:05:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:05:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:08:27] (03CR) 10BCornwall: [V:03+1 C:03+2] "Tests are all passing except for an unexpected broken test introduced in I8553991e419f604585d812db2ce66c9a05a4e764" [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [18:11:56] (03CR) 10Brouberol: opensearch-cluster: Add chart for review (3/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [18:12:14] (03PS2) 10Cappybaraa: diqwiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) [18:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [18:21:15] (03CR) 10Brouberol: "Does a cluster expose prometheus metrics? Do we scrape them?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [18:22:20] (03CR) 10Aaron Schulz: "Interesting that this is opt-out. I get that these CSP headers are used restbase compatibility and perhaps some other non-MW endpoints tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368) (owner: 10Hnowlan) [18:23:07] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [18:23:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11226087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [18:24:14] (03PS3) 10Andrea Denisse: mediawiki-engineering: Add API Gateway alerts with thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) [18:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:26:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:28:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:32:34] (03PS9) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [18:33:02] (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [18:34:42] (03CR) 10Dr0ptp4kt: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [18:35:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:35:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:38:58] (03PS1) 10Dzahn: zuul: create systemd unit for zuul scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) [18:39:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:39:19] jhancock@cumin1002 reimage (PID 4182690) is awaiting input [18:44:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:47:10] (03PS1) 10Dzahn: move zuul nodepool to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938) [18:47:24] (03CR) 10Dzahn: "needs https://gerrit.wikimedia.org/r/1192200 to compile" [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:47:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [18:48:02] (03PS2) 10Dzahn: move zuul nodepool to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938) [18:48:08] (03CR) 10Dzahn: [C:03+2] move zuul nodepool to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:48:26] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [18:48:32] (03PS3) 10Dzahn: move zuul nodepool user token to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938) [18:48:47] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1191504|Disable wmgUseMdotRouting on wikimedia.org wikis (group1) (T403510)]] [18:48:54] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [18:50:30] Lucas_WMDE: btw i finally looked at the log line you posted, [18:50:33] [29 Sep 25 16:14] * Lucas_WMDE MatmaRex: “amwiki Would update performer for local #101754 based on global #59763297 from 'Nahomnata' to 'J ansari'” [18:51:38] this is in fact correct – "Nahomnata" is not a renamer (and they have 0 edits), their account just by accident has the same ID on amwiki as "J ansari" has on metawiki [18:54:21] (actor ID) [18:55:23] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1191504|Disable wmgUseMdotRouting on wikimedia.org wikis (group1) (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:55:30] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [18:55:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:55:54] (03PS1) 10Elukey: role::maps::master: enable planet sync in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1192205 (https://phabricator.wikimedia.org/T381565) [18:56:16] (03CR) 10Dzahn: [V:03+2 C:03+2] move zuul nodepool user token to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:57:32] !log krinkle@deploy2002 krinkle: Continuing with sync [18:57:35] (03CR) 10Elukey: [V:03+1] "@mmuhlenhoff@wikimedia.org I am re-enabling osm import on maps2009 to allow it to catch up over night, we'd need a codfw stack in two days" [puppet] - 10https://gerrit.wikimedia.org/r/1192205 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [18:57:42] (03CR) 10Elukey: "@mmuhlenhoff@wikimedia.org I am re-enabling osm import on maps2009 to allow it to catch up over night, we'd need a codfw stack in two days" [puppet] - 10https://gerrit.wikimedia.org/r/1192205 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [18:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:58:32] (03CR) 10Elukey: [C:03+2] role::maps::master: enable planet sync in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1192205 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [18:58:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:59:25] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958 (10RobH) 03NEW [18:59:57] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11226273 (10RobH) a:03MatthewVernon @MatthewVernon, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.... [19:01:12] (03CR) 10Phuedx: "Understood. Many thanks for the explanation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [19:01:51] (03CR) 10Eric Gardner: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [19:02:37] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191504|Disable wmgUseMdotRouting on wikimedia.org wikis (group1) (T403510)]] (duration: 13m 50s) [19:02:44] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [19:02:46] (03PS1) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 [19:05:11] (03PS1) 10Ssingh: P:cache::haproxy: exempt mediawiki.org and /keys from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) [19:06:05] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7110/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [19:06:26] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1192179/7108/" [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:06:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:06:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:07:19] (03PS2) 10Ssingh: P:cache::haproxy: exempt mediawiki.org and /keys from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) [19:08:11] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7111/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [19:08:41] (03CR) 10Dzahn: "just a thought. since official MW releases come from releases.wikimedia.org you could also consider moving the GPG key there" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [19:09:56] (03CR) 10Ssingh: [V:03+1] "Yeah it's a good point and perhaps we should add that as well. Specifically in this case, the CR is a response to T405165." [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [19:10:21] (03PS1) 10Scott French: deployment_server: add mw-script/next tracking PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192202 (https://phabricator.wikimedia.org/T405955) [19:10:21] (03CR) 10Scott French: "Thanks in advance for the reviews, Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/1192202 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:10:22] (03PS1) 10Scott French: deployment_server: enable support for PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192203 (https://phabricator.wikimedia.org/T405955) [19:11:07] (03PS3) 10Ssingh: P:cache::haproxy: exempt mediawiki.org and /keys from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) [19:12:10] (03PS4) 10Ssingh: P:cache::haproxy: exempt mediawiki.org and /keys from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) [19:13:04] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7112/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [19:20:21] (03PS1) 10Dzahn: zuul: follow-up fix to moving nodepool config to own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192215 (https://phabricator.wikimedia.org/T395938) [19:25:23] (03CR) 10BCornwall: "I agree with Daniel - fewer exceptions to handle!" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh) [19:25:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:26:26] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964 (10RobH) 03NEW [19:26:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226439 (10VRiley-WMF) [19:26:58] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11226455 (10RobH) [19:27:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226456 (10VRiley-WMF) moving onto ms-be1086 [19:27:38] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11226459 (10RobH) a:03bking @bking, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and... [19:27:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:28:56] (03CR) 10Dzahn: [C:03+2] zuul: follow-up fix to moving nodepool config to own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192215 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:30:00] (03CR) 10RLazarus: [C:03+1] deployment_server: add mw-script/next tracking PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192202 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:30:08] (03CR) 10RLazarus: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1192203 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:30:45] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966 (10RobH) 03NEW [19:31:17] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11226497 (10RobH) a:03bking @bking, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and... [19:32:05] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11226505 (10RobH) [19:33:59] (03PS5) 10Kosta Harlan: hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) [19:34:05] (03CR) 10Kosta Harlan: hCaptcha: Enable A/B test for frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [19:35:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:35:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:36:14] jouncebot: nowandnext [19:36:14] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [19:36:14] In 0 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T2000) [19:37:34] unless there are any objections, I might merge a puppet patch shortly that requires a follow-on no-sync (i.e., no deploy) scap run [19:38:33] (03PS28) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [19:38:42] (03CR) 10Scott French: [C:03+2] deployment_server: add mw-script/next tracking PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192202 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:41:50] * swfrench-wmf is running puppet-agent [19:42:19] I'll be running scap in ~ 5 minutes [19:44:04] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11226602 (10Jhancock.wm) got the pxe issue fixed. but found a new one. @Clement_Goubert this server has to be uefi and it looks like the preseed is set up for bios. if i'm reading... [19:44:13] (03PS2) 10Dzahn: zuul: create systemd unit for zuul scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) [19:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:48:32] !log swfrench@deploy2002 Started scap sync-world: Non-deploy scap run to initialize mw-script/next helmfile-defaults values - T405955 [19:48:41] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [19:48:48] !log swfrench@deploy2002 Stopping before sync operations [19:49:48] * swfrench-wmf is done [19:50:38] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226618 (10VRiley-WMF) 05In progress→03Open [19:51:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226637 (10VRiley-WMF) These are all done! will await for the next two. Thanks @MatthewVernon [19:51:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226648 (10VRiley-WMF) These are all done! will await for the next two. Thanks @MatthewVernon [19:52:09] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1192195/7114/zuul1001.eqiad.wmnet/change.zuul1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [19:54:23] (03PS3) 10Dzahn: zuul: create systemd unit for zuul scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) [19:57:13] (03CR) 10Scott French: [C:03+2] deployment_server: enable support for PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192203 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:57:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:57:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:57:44] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1192195/7117/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [19:57:48] (03CR) 10Dzahn: [C:03+2] zuul: create systemd unit for zuul scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T2000) [20:00:04] lucaswerkmeister and sergi0: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: WIP [20:00:18] o/ [20:00:21] o/ [20:00:28] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2001.codfw.wmnet with reason: WIP [20:04:25] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [20:04:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11226748 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm executed with errors: -... [20:04:53] lucaswerkmeister: are you self-deploying? [20:05:29] preferably not, as it’s Lucas_WMDE who has deployment rights, not my volunteer self ^^ [20:05:36] but I guess if nobody else is around… [20:05:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:05:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:05:44] Alright, I can deploy then [20:06:28] cool, thanks! [20:08:16] (03PS1) 10CDanis: puppetserver::volatile: Default to no XCheeseScore [puppet] - 10https://gerrit.wikimedia.org/r/1192224 [20:08:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191804 (https://phabricator.wikimedia.org/T405830) (owner: 10Lucas Werkmeister) [20:09:00] (03PS3) 10Andrew Bogott: P:openstack: nova: Drop obsolete settings [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah) [20:09:03] * lucaswerkmeister tries to put a test case together [20:09:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah) [20:09:14] (03Merged) 10jenkins-bot: Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191804 (https://phabricator.wikimedia.org/T405830) (owner: 10Lucas Werkmeister) [20:09:32] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on zuul2001.codfw.wmnet with reason: WIP [20:09:37] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1191804|Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata (T405830)]] [20:09:43] T405830: Enable USERLANGUAGE magic word for Wikidata - https://phabricator.wikimedia.org/T405830 [20:09:44] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: WIP [20:09:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:03] ok I should be able to test it at https://www.wikidata.org/wiki/User:Lucas_Werkmeister/sandbox?uselang=de once it’s on mwdebug [20:11:28] (03PS29) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [20:11:41] (03CR) 10Andrew Bogott: [C:03+1] P:openstack: nova: Drop obsolete settings [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah) [20:14:27] (03CR) 10Bking: "Yes, the cluster exposes metrics at `_prometheus/metrics` on the primary port (9200). I added some annotations in the last patchset to exp" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:16:29] !log sgimeno@deploy2002 lucaswerkmeister, sgimeno: Backport for [[gerrit:1191804|Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata (T405830)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:16:35] T405830: Enable USERLANGUAGE magic word for Wikidata - https://phabricator.wikimedia.org/T405830 [20:16:48] lucaswerkmeister: please test [20:17:09] it works \o/ [20:17:16] after a purge, https://www.wikidata.org/wiki/User:Lucas_Werkmeister/sandbox?uselang=de says German instead of English [20:17:37] great, syncing [20:17:53] !log sgimeno@deploy2002 lucaswerkmeister, sgimeno: Continuing with sync [20:18:00] thanks! [20:18:09] yw [20:22:37] (03PS30) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [20:23:00] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191804|Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata (T405830)]] (duration: 13m 23s) [20:23:07] T405830: Enable USERLANGUAGE magic word for Wikidata - https://phabricator.wikimedia.org/T405830 [20:23:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno) [20:24:46] (03Merged) 10jenkins-bot: Growth: enable new notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno) [20:25:05] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1190703|Growth: enable new notifications (T404085)]] [20:25:11] T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085 [20:25:40] (03CR) 10Bking: opensearch-cluster: Add chart for review (3/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:26:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:26:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:29:51] 10ops-codfw, 06DC-Ops: Alert for device ps1-d2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405973 (10phaultfinder) 03NEW [20:32:25] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1190703|Growth: enable new notifications (T404085)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:32:31] T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085 [20:33:47] !log sgimeno@deploy2002 sgimeno: Continuing with sync [20:33:57] (03PS1) 10BCornwall: Remove wikimedia_trust ACLs from varnish/haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688) [20:36:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:36:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:38:50] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1190703|Growth: enable new notifications (T404085)]] (duration: 13m 45s) [20:38:55] (03CR) 10Btullis: [C:03+2] Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [20:38:57] T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085 [20:39:35] (03CR) 10Btullis: [C:03+2] Import the upstream spark-operator chart version 2.2.1 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [20:40:21] (03Merged) 10jenkins-bot: Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [20:40:26] !log end of UTC late backport window [20:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:00] (03Merged) 10jenkins-bot: Import the upstream spark-operator chart version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [20:42:36] 10ops-codfw, 06SRE, 06DC-Ops: codfw netbox cable cleanup - https://phabricator.wikimedia.org/T402535#11226865 (10Jhancock.wm) a:03Jhancock.wm [20:42:47] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7123/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688) (owner: 10BCornwall) [20:43:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11226867 (10Papaul) [20:44:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11226869 (10Papaul) p:05Triage→03Medium [20:55:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:56:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:57:43] (03PS31) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [20:59:27] (03CR) 10Cappybaraa: "Portal is already added to core-Namespaces.php, I checked and it does not need changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [21:00:05] Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T2100). [21:01:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [21:01:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1236.eqiad.wmnet with OS bullseye [21:05:39] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:05:39] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:05:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11226930 (10BTullis) p:05Triage→03High [21:09:57] (03PS1) 10BCornwall: wikimedia.support: Rm ncredir, add zendesk records [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) [21:10:32] if the security window isn't in use today, I might deploy some envoy upgrades [21:13:36] (03PS2) 10BCornwall: wikimedia.support: Rm ncredir, add zendesk records [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) [21:14:48] (03PS2) 10Ryan Kemper: wdqs: shift old full graph hosts to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772) [21:16:13] (03CR) 10Bking: [C:03+1] "nit: a couple of the hosts are changing roles to internal-scholarly and scholarly (not just internal-main)" [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:16:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11226958 (10Papaul) [21:16:55] (03PS3) 10Ryan Kemper: wdqs: shift old full graph hosts to new roles [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772) [21:17:35] (03CR) 10Ryan Kemper: [C:03+2] wdqs: shift old full graph hosts to new roles [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:24:41] (03PS1) 10Btullis: Add 28 new hadoop workers to the analytics_hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1192239 (https://phabricator.wikimedia.org/T398438) [21:27:39] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:27:39] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:28:02] !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs2016.codfw.wmnet with OS bullseye [21:28:13] (03PS1) 10Scott French: deployment_server: switch next and migration releases to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192227 (https://phabricator.wikimedia.org/T405955) [21:28:14] (03PS1) 10Scott French: trafficserver: enable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1192228 (https://phabricator.wikimedia.org/T405955) [21:31:22] (03CR) 10Bking: opensearch-cluster: Add chart for review (3/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:33:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 8 CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1192239 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [21:33:57] !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs2017.codfw.wmnet with OS bullseye [21:35:38] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:35:38] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:36:30] (03CR) 10RLazarus: [C:03+2] mw-*: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191522 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [21:36:53] !log btullis@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1209-1236].eqiad.wmnet [21:38:33] (03Merged) 10jenkins-bot: mw-*: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191522 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [21:38:33] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11227045 (10VRiley-WMF) 05Open→03Resolved Replacement unit received and deployed. Contacted vendor multiple times regarding return of the damaged PDU, but no instructions/shipping label have been provided. A... [21:39:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1209-1236].eqiad.wmnet [21:40:54] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1191522 T403663 [21:41:01] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [21:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:42:38] 07sre-alert-triage, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11227064 (10BTullis) a:03BTullis [21:43:55] 07sre-alert-triage, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11227070 (10BTullis) p:05Triage→03Low [21:45:41] !log bking@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage [21:46:37] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1191522 T403663 (duration: 06m 44s) [21:46:44] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [21:47:55] 07sre-alert-triage, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11227080 (10BTullis) There are now 5 hosts showing this error: {F66711315} * an-worker1187 * an-... [21:48:51] !log bking@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage [21:52:21] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980 (10RobH) 03NEW [21:52:37] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11227121 (10RobH) [21:54:34] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11227127 (10RobH) @jgreen, Please note this host will still leverage BIOS capable booting and can be setup as such (you did not specify in the ordering task) but future generations sta... [21:54:58] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11227128 (10RobH) [21:55:14] (03CR) 10Btullis: "I wonder if we should revisit the reasons for choosing the opensearch-operator version 2.7.0." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:56:03] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11227129 (10RobH) [21:57:24] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981 (10RobH) 03NEW [21:57:59] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11227159 (10RobH) [22:01:39] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio200[1-3] - https://phabricator.wikimedia.org/T405982 (10RobH) 03NEW [22:01:45] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio100[1-3] - https://phabricator.wikimedia.org/T405983 (10RobH) 03NEW [22:01:54] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio200[1-3] - https://phabricator.wikimedia.org/T405982#11227198 (10RobH) [22:02:07] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio100[1-3] - https://phabricator.wikimedia.org/T405983#11227202 (10RobH) [22:05:43] (03CR) 10RLazarus: [C:03+2] mw-videoscaler: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191523 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [22:07:21] (03Merged) 10jenkins-bot: mw-videoscaler: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191523 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [22:14:12] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [22:14:16] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [22:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [22:17:41] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [22:17:48] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [22:18:23] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Wikisource [puppet] - 10https://gerrit.wikimedia.org/r/1192246 (https://phabricator.wikimedia.org/T403510) [22:18:25] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/1192247 (https://phabricator.wikimedia.org/T403510) [22:21:59] (03CR) 10Bking: "I should have done a better job of documenting this, but the 2.8.0 chart is not compatible with OpenSearch 2.7.0 (see https://github.com/o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [22:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:25:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [22:27:32] (03CR) 10Dzahn: [C:03+1] "as far as I can see it looks good to me" [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [22:29:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [22:34:38] 06SRE, 13Patch-For-Review: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11227347 (10BCornwall) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180712 Seems to have broken varnish tests. Looking through seems to suggest this is because `profile::cache... [22:35:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [22:35:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [22:38:43] (03PS1) 10Btullis: Customise the login.html template of JupyterHub to hide the TLS warning [puppet] - 10https://gerrit.wikimedia.org/r/1192259 (https://phabricator.wikimedia.org/T403863) [22:40:17] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7133/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192259 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis) [22:54:18] !log bking@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2017.codfw.wmnet with OS bullseye [22:54:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [22:55:37] !log bking@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [22:55:49] !log bking@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [22:56:25] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [22:56:35] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [22:56:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [22:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T2300) [23:05:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:05:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:06:20] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on Wikisource [puppet] - 10https://gerrit.wikimedia.org/r/1192246 (https://phabricator.wikimedia.org/T403510) [23:06:20] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/1192247 (https://phabricator.wikimedia.org/T403510) [23:06:20] (03PS1) 10Krinkle: beta: Remove redundant enable_m_redir_except_regex setting [puppet] - 10https://gerrit.wikimedia.org/r/1192263 (https://phabricator.wikimedia.org/T403510) [23:06:22] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/1192264 (https://phabricator.wikimedia.org/T403510) [23:06:24] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510) [23:06:27] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on fr.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) [23:06:29] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on de.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192267 (https://phabricator.wikimedia.org/T403510) [23:06:31] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on es.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510) [23:06:35] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on ru.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192269 (https://phabricator.wikimedia.org/T403510) [23:06:38] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on ja.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192270 (https://phabricator.wikimedia.org/T403510) [23:06:42] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [23:06:46] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510) [23:15:04] (03PS1) 10Krinkle: beta: Remove redundant enable_m_redir_except_regex setting [puppet] - 10https://gerrit.wikimedia.org/r/1192263 (https://phabricator.wikimedia.org/T403510) [23:23:23] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192276 (https://phabricator.wikimedia.org/T403510) [23:23:26] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192277 (https://phabricator.wikimedia.org/T403510) [23:23:28] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192278 (https://phabricator.wikimedia.org/T403510) [23:23:30] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192279 (https://phabricator.wikimedia.org/T403510) [23:25:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:27:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:35:45] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:35:45] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:37:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192281 [23:37:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192281 (owner: 10TrainBranchBot) [23:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:46:22] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [23:46:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11227509 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet wi... [23:49:38] (03PS1) 10RLazarus: deployment_server: Prefix `helmfile apply` output with "[service env]" [puppet] - 10https://gerrit.wikimedia.org/r/1192282 [23:49:45] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2016 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:49:45] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:54:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192281 (owner: 10TrainBranchBot) [23:54:51] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:56:45] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:58:08] (03PS1) 10BCornwall: Remove wikimedia.support from ncredir/acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/1192283 (https://phabricator.wikimedia.org/T400952) [23:58:45] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:59:10] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable