[00:04:51] <jinxer-wm>	 FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191865
[00:08:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191865 (owner: 10TrainBranchBot)
[00:29:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191865 (owner: 10TrainBranchBot)
[01:00:40] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:14:30] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 50s)
[01:36:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:14:51] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[02:24:51] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[02:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:32:32] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:44:51] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[04:09:10] <jinxer-wm>	 FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:28:24] <wikibugs>	 (03CR) 10Finchgold: [C:03+1] Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191804 (https://phabricator.wikimedia.org/T405830) (owner: 10Lucas Werkmeister)
[05:06:13] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191364 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry)
[05:08:10] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191364 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry)
[05:09:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:13:35] <wikibugs>	 (03PS1) 10KartikMistry: Revert "Update cxserver to 2025-09-25-074241-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191872
[05:20:36] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Revert "Update cxserver to 2025-09-25-074241-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191872 (owner: 10KartikMistry)
[05:22:34] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update cxserver to 2025-09-25-074241-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191872 (owner: 10KartikMistry)
[05:34:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:39:10] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:56:58] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[05:58:50] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Docker
[06:14:51] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[06:24:51] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[06:28:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet
[06:33:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet
[06:37:48] <moritzm>	 !log upgrade Envoy on chartmuseum hosts T403663
[06:37:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:56] <stashbot>	 T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663
[06:38:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet
[06:38:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet
[06:38:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet
[06:41:11] <moritzm>	 dse-k8s-etcd1003 and ml-etcd1002 will go down for a Ganeti reboot
[06:41:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1048.eqiad.wmnet
[06:43:14] <icinga-wm>	 PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[06:43:32] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:45:32] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[06:45:42] <icinga-wm>	 RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[06:46:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1048.eqiad.wmnet
[06:46:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet
[06:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T0700). Please do the needful.
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:07:48] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[07:13:19] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] haproxy: use Lua 5.3 for Docker tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis)
[07:17:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet
[07:19:40] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[07:20:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1049.eqiad.wmnet
[07:21:12] <wikibugs>	 (03PS3) 10Majavah: haproxy::cloud: Add an admin-level socket [puppet] - 10https://gerrit.wikimedia.org/r/1191662
[07:21:12] <wikibugs>	 (03PS6) 10Majavah: haproxy::cloud: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664
[07:21:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] imposm-initial-import: Set service passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:23:33] <wikibugs>	 (03CR) 10Majavah: [C:03+2] haproxy::cloud: Add an admin-level socket [puppet] - 10https://gerrit.wikimedia.org/r/1191662 (owner: 10Majavah)
[07:25:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1049.eqiad.wmnet
[07:25:42] <wikibugs>	 (03CR) 10Majavah: [C:03+2] haproxy::cloud: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 (owner: 10Majavah)
[07:25:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1049.eqiad.wmnet
[07:25:58] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[07:26:50] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Docker
[07:27:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1050.eqiad.wmnet
[07:32:19] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3019950) is awaiting input
[07:32:29] <wikibugs>	 06SRE, 06collaboration-services, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11223257 (10MoritzMuehlenhoff)
[07:32:32] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:33:09] <wikibugs>	 (03CR) 10Elukey: osm_master: Store kartotherian and tegola passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:34:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1050.eqiad.wmnet
[07:35:21] <wikibugs>	 (03CR) 10Muehlenhoff: osm_master: Store kartotherian and tegola passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:37:11] <moritzm>	 !log upgrade Envoy on config-master* T403663
[07:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:18] <stashbot>	 T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663
[07:39:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1050.eqiad.wmnet
[07:39:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1050.eqiad.wmnet
[07:41:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet
[07:44:39] <wikibugs>	 (03PS1) 10Elukey: CHANGELOG: add changelogs for release v11.9.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191983
[07:44:44] <wikibugs>	 (03PS1) 10Jelto: ceph::client::sync_local: fix ensure for directory [puppet] - 10https://gerrit.wikimedia.org/r/1191984 (https://phabricator.wikimedia.org/T378922)
[07:44:51] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:45:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1051.eqiad.wmnet
[07:46:22] <wikibugs>	 07Puppet, 10MobileFrontend (Tracking): Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425#11223314 (10Krinkle)
[07:46:57] <wikibugs>	 07Puppet, 10MobileFrontend (Tracking): Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425#11223316 (10Krinkle)
[07:47:01] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1191984 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[07:47:14] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11223317 (10Krinkle)
[07:51:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1051.eqiad.wmnet
[07:51:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet
[07:52:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1052.eqiad.wmnet
[07:52:13] <wikibugs>	 (03CR) 10Elukey: [C:03+1] osm_master: Store kartotherian and tegola passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:52:33] <wikibugs>	 (03CR) 10Hashar: [C:04-1] phabricator: hiera'ize the apc_shm_size variable (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[07:55:06] <moritzm>	 ml-etcd1003 will go down for a Ganeti reboot
[07:55:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1052.eqiad.wmnet
[07:55:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] osm_master: Store kartotherian and tegola passwords [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:55:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v11.9.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191983 (owner: 10Elukey)
[07:56:34] <icinga-wm>	 PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[07:58:26] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] ceph::client::sync_local: fix ensure for directory [puppet] - 10https://gerrit.wikimedia.org/r/1191984 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:00:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1052.eqiad.wmnet
[08:00:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1052.eqiad.wmnet
[08:00:42] <icinga-wm>	 RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms
[08:01:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1053.eqiad.wmnet
[08:02:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1053.eqiad.wmnet
[08:02:25] <wikibugs>	 (03PS1) 10Elukey: Upstream release v11.9.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1192050
[08:05:55] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v11.9.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1192050 (owner: 10Elukey)
[08:07:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1053.eqiad.wmnet
[08:07:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1053.eqiad.wmnet
[08:08:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1054.eqiad.wmnet
[08:09:51] <jinxer-wm>	 FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:11:34] <wikibugs>	 (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191984 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:11:55] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3041113) is awaiting input
[08:13:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Don't set profile::maps::osm_master::tilerator_pass in role default [puppet] - 10https://gerrit.wikimedia.org/r/1192054 (https://phabricator.wikimedia.org/T381565)
[08:14:41] <wikibugs>	 (03PS1) 10Slyngshede: Update CAS to version 7.1.6.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1192055
[08:15:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1054.eqiad.wmnet
[08:16:28] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11223446 (10Jelto) Sync from object storage to a local folder works with the new `ceph::client::sync_local` module. I tested this o...
[08:17:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1192055 (owner: 10Slyngshede)
[08:17:35] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: update to Redfish's hw_model [cookbooks] - 10https://gerrit.wikimedia.org/r/1192056
[08:18:41] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Update CAS to version 7.1.6.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1192055 (owner: 10Slyngshede)
[08:20:30] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: CI error on operations/cookbooks - https://phabricator.wikimedia.org/T405706#11223453 (10elukey) https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1192056
[08:20:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1054.eqiad.wmnet
[08:20:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1054.eqiad.wmnet
[08:20:52] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "thanks for the quickfix! lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1192056 (owner: 10Elukey)
[08:20:54] <elukey>	 !log uploaded spicerack_11.9.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia
[08:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet
[08:26:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet
[08:32:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet
[08:32:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet
[08:33:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[08:35:16] <elukey>	 !log rolled out spicerack 11.9.0 to all cumin nodes
[08:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:22] <wikibugs>	 (03PS1) 10Slyngshede: IDP: Upgrade to CAS 7.1.6.2 [dns] - 10https://gerrit.wikimedia.org/r/1192058
[08:35:29] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.provision: update to Redfish's hw_model [cookbooks] - 10https://gerrit.wikimedia.org/r/1192056 (owner: 10Elukey)
[08:35:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet
[08:36:16] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 10Spicerack: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397#11223585 (10elukey) Spicerack 11.9.0 deployed on all cumin nodes :)
[08:37:34] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Don't set profile::maps::osm_master::tilerator_pass in role default [puppet] - 10https://gerrit.wikimedia.org/r/1192054 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:38:21] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP: Upgrade to CAS 7.1.6.2 [dns] - 10https://gerrit.wikimedia.org/r/1192058 (owner: 10Slyngshede)
[08:38:27] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[08:38:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[08:39:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Don't set profile::maps::osm_master::tilerator_pass in role default [puppet] - 10https://gerrit.wikimedia.org/r/1192054 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:39:50] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[08:43:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet
[08:44:53] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: CI error on operations/cookbooks - https://phabricator.wikimedia.org/T405706#11223632 (10ABran-WMF) 05Open→03Resolved a:03elukey [[ https://integration.wikimedia.org/ci/job/tox/7677/console | CI went through ]], thanks for the fix!
[08:45:23] <wikibugs>	 (03PS7) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808)
[08:45:49] <wikibugs>	 (03PS10) 10Arnaudb: gerrit: bugfixes on failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833)
[08:45:49] <wikibugs>	 (03CR) 10Arnaudb: "full dry-run output is visible here: https://phabricator.wikimedia.org/P83469" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[08:46:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[08:49:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet
[08:49:31] <wikibugs>	 (03PS1) 10Slyngshede: Revert "IDP: Upgrade to CAS 7.1.6.2" [dns] - 10https://gerrit.wikimedia.org/r/1192059
[08:49:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet
[08:50:41] <wikibugs>	 (03PS1) 10Jelto: gitlab: enable bucket sync on production host [puppet] - 10https://gerrit.wikimedia.org/r/1192060 (https://phabricator.wikimedia.org/T378922)
[08:51:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet
[08:52:02] <jynus>	 !log powercycling db1150 T405885
[08:52:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:08] <stashbot>	 T405885: db1150 crash - https://phabricator.wikimedia.org/T405885
[08:52:19] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Revert "IDP: Upgrade to CAS 7.1.6.2" [dns] - 10https://gerrit.wikimedia.org/r/1192059 (owner: 10Slyngshede)
[08:52:32] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[08:53:17] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1192060 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:53:57] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[08:54:39] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3061885) is awaiting input
[08:54:52] <icinga-wm>	 RECOVERY - Host db1150 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[08:54:56] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 on db1150 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:55:04] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s4 on db1150 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:55:04] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 on db1150 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:55:26] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s3 on db1150 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:56:14] <icinga-wm>	 PROBLEM - MariaDB read only s3 on db1150 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[08:56:29] <wikibugs>	 (03PS2) 10Muehlenhoff: imposm-initial-import: Set service passwords [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565)
[08:56:45] <wikibugs>	 (03CR) 10Muehlenhoff: imposm-initial-import: Set service passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:57:04] <icinga-wm>	 PROBLEM - mysqld processes on db1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[08:57:04] <moritzm>	 kubestagemaster2005 will go down for a Ganeti reboot
[08:57:14] <icinga-wm>	 PROBLEM - MariaDB read only s4 on db1150 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[08:58:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet
[09:00:14] <icinga-wm>	 PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100%
[09:00:47] <wikibugs>	 (03PS1) 10Slyngshede: IDP: CAS 7.1.6.2 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1192062
[09:03:03] <wikibugs>	 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11223739 (10elukey) To recap, it seems that we have two problems:  1) For some mysterious reasons, sretest2010 seems to have stopped...
[09:03:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11223742 (10elukey) >>! In T404356#11217341, @jhathaway wrote: >>>! In T404356#11184299, @elukey wrote: >> The host doesn't PXE/HTTP boot for some reason, I reopened the provision...
[09:04:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet
[09:04:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet
[09:04:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:05:24] <icinga-wm>	 RECOVERY - Host kubestagemaster2005 is UP: PING WARNING - Packet loss = 80%, RTA = 30.49 ms
[09:05:53] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[09:06:42] <wikibugs>	 (03CR) 10Sergio Gimeno: "This is now ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno)
[09:07:17] <wikibugs>	 06SRE, 06Product Safety and Integrity, 05MW-1.45-notes (1.45.0-wmf.21; 2025-09-30), 13Patch-For-Review, 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11223756 (10OKryva-WMF)
[09:08:35] <wikibugs>	 (03PS3) 10Sergio Gimeno: Growth: enable new notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085)
[09:09:26] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] IDP: CAS 7.1.6.2 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1192062 (owner: 10Slyngshede)
[09:09:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:10:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP: CAS 7.1.6.2 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1192062 (owner: 10Slyngshede)
[09:10:32] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[09:11:04] <slyngs>	 !log Upgrading IDP/CAS-SSO to version 7.1.6.2
[09:11:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:08] <wikibugs>	 06SRE, 05MW-1.45-notes (1.45.0-wmf.21; 2025-09-30), 13Patch-For-Review, 06Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to ... - https://phabricator.wikimedia.org/T404204#11223798
[09:11:57] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[09:13:12] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] varnish: remove Host header normalization [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur)
[09:23:10] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync
[09:24:00] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150 [puppet] - 10https://gerrit.wikimedia.org/r/1192069 (https://phabricator.wikimedia.org/T405885)
[09:24:03] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync
[09:24:36] <gehel>	 !log restarting blazegraph on wdqs2007, wdqs2021 and wdqs2011 (high thread count)
[09:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:07] <wikibugs>	 (03PS1) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:25:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[09:26:46] <wikibugs>	 (03PS2) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:27:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:28:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:31:20] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[09:31:42] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[09:32:17] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:32:20] <wikibugs>	 (03PS3) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:32:58] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:34:36] <wikibugs>	 (03CR) 10David Caro: tools: add more reliable stats on nfs stuck workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[09:34:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: tools: add more reliable stats on nfs stuck workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[09:37:09] <wikibugs>	 (03PS4) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:37:24] <wikibugs>	 (03PS5) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:37:26] <wikibugs>	 (03CR) 10David Caro: tools: add more reliable stats on nfs stuck workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[09:37:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet
[09:37:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Set service passwords [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:37:58] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:38:45] <wikibugs>	 (03CR) 10David Caro: tools: add more reliable stats on nfs stuck workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[09:41:23] <gehel>	 !log depooling wdqs2007, wdqs2021 and wdqs2011 (update lag)
[09:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:54] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: MariaDB package update
[09:42:33] <moritzm>	 ml-staging-etcd2001 will go down for a ganeti reboot
[09:42:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet
[09:43:03] <wikibugs>	 (03PS6) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:43:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[09:43:48] <wikibugs>	 (03PS7) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:44:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[09:44:20] <icinga-wm>	 PROBLEM - Host ml-staging-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:45:11] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150 [puppet] - 10https://gerrit.wikimedia.org/r/1192069 (https://phabricator.wikimedia.org/T405885)
[09:45:12] <wikibugs>	 (03PS8) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:48:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet
[09:48:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet
[09:49:10] <wikibugs>	 (03CR) 10Arnaudb: "some answers, and a question inline." [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[09:49:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[09:50:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Add maps1012 to maps1014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565)
[09:50:32] <icinga-wm>	 RECOVERY - Host ml-staging-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms
[09:50:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet
[09:52:25] <wikibugs>	 (03PS9) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:52:59] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade db1245 to MariaDB 10.11 so it can take over db1150 [puppet] - 10https://gerrit.wikimedia.org/r/1192069 (https://phabricator.wikimedia.org/T405885) (owner: 10Jcrespo)
[09:53:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet
[09:55:35] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add maps1012 to maps1014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:58:39] <wikibugs>	 (03PS10) 10David Caro: tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070
[09:59:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet
[10:00:04] <wikibugs>	 (03PS1) 10Elukey: aptrepo: add kubernetes 1.23 support to Trixie Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1192087 (https://phabricator.wikimedia.org/T405891)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1000)
[10:00:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet
[10:00:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet
[10:01:53] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "pcc looks good, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1192060 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[10:03:19] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1245.eqiad.wmnet
[10:03:19] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1245.eqiad.wmnet
[10:04:51] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3097273) is awaiting input
[10:08:39] <icinga-wm>	 ACKNOWLEDGEMENT - snapshot of s3 in eqiad on backupmon1001 is CRITICAL: snapshot for s3 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2025-09-25 05:35:19 Jcrespo T405885 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:08:39] <icinga-wm>	 ACKNOWLEDGEMENT - snapshot of s4 in eqiad on backupmon1001 is CRITICAL: snapshot for s4 at eqiad (db1150) taken more than 3 days ago: Most recent backup 2025-09-25 02:18:53 Jcrespo T405885 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:13:55] <wikibugs>	 (03PS3) 10Clément Goubert: rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368)
[10:14:51] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[10:16:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
[10:19:13] <wikibugs>	 (03CR) 10David Caro: [C:03+2] tools: add more reliable stats on nfs stuck workers [puppet] - 10https://gerrit.wikimedia.org/r/1192070 (owner: 10David Caro)
[10:19:31] <wikibugs>	 (03PS6) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga)
[10:19:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga)
[10:22:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
[10:22:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet
[10:23:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet
[10:24:51] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[10:25:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet
[10:25:51] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11224014 (10jcrespo) Sorry I didn't provide details last week, but it was quite late in my timezone. You already saw the issue, which I was late to detect because everything else w...
[10:27:01] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dbprov1007.eqiad.wmnet with reason: needs reimage
[10:27:20] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11224033 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c6c8bf64-47b3-4e1e-b33d-0785ef15336a) set by jynus@cumin1003 for 7 days, 0:00:00 on 1 host(s) and their...
[10:27:58] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:29:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192087 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[10:31:25] <wikibugs>	 (03CR) 10Effie Mouzeli: "Looks OK (as on a regex can look).It would be great if in the future we add a couple of comments in the yaml file to explain what those re" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[10:31:28] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[10:31:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet
[10:31:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet
[10:31:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2048.codfw.wmnet
[10:32:58] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:33:32] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] haproxy ipblocks-all: Filter disabled ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/1190274 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm)
[10:33:45] <wikibugs>	 (03CR) 10Clément Goubert: "Thanks for the review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[10:34:07] <Dreamy_Jazz>	 !log Created `global_block_whitelist` on thwikimedia - T400001
[10:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:13] <stashbot>	 T400001: Create a Wiki for Wikimedia Thailand - https://phabricator.wikimedia.org/T400001
[10:36:05] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "Cheers, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[10:36:21] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3112619) is awaiting input
[10:37:16] <wikibugs>	 (03PS1) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406)
[10:37:20] <moritzm>	 dse-k8s-etcd2002 is going down for a Ganeti reboot
[10:37:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet
[10:37:37] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] aptrepo: add kubernetes 1.23 support to Trixie Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1192087 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[10:39:18] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:40:50] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.68 ms
[10:42:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet
[10:42:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2048.codfw.wmnet
[10:43:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2049.codfw.wmnet
[10:44:31] <wikibugs>	 (03PS10) 10Brouberol: airflow: automatically figure out some values to reduce release config size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485)
[10:45:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) (owner: 10Btullis)
[10:46:25] <moritzm>	 aux-k8s-etcd2004 is going down for a Ganeti reboot
[10:46:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet
[10:48:22] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100%
[10:49:57] <wikibugs>	 (03CR) 10Btullis: airflow: automatically figure out some values to reduce release config size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol)
[10:50:30] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.78 ms
[10:51:06] <wikibugs>	 (03CR) 10Brouberol: airflow: automatically figure out some values to reduce release config size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol)
[10:51:24] <wikibugs>	 (03CR) 10Elukey: [C:03+2] aptrepo: add kubernetes 1.23 support to Trixie Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1192087 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[10:52:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet
[10:52:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2049.codfw.wmnet
[10:52:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2050.codfw.wmnet
[10:52:36] <wikibugs>	 (03PS2) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406)
[10:52:43] <wikibugs>	 (03CR) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) (owner: 10Btullis)
[10:53:07] <moritzm>	 !log upgrade Envoy on an-web1001 T403663
[10:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:12] <stashbot>	 T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663
[10:54:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) (owner: 10Btullis)
[10:55:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet
[10:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:59:09] <gehel>	 !log pooling wdqs2021 and wdqs2011 (caught up on lag)
[10:59:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:11] <wikibugs>	 (03CR) 10Jon Harald Søby: [C:04-1] "Like I mentioned in the other patch, these namespace aliases need to be added to that file. But you can re-use this patch to change the co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa)
[11:00:38] <wikibugs>	 (03PS3) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406)
[11:00:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet
[11:00:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2050.codfw.wmnet
[11:01:53] <wikibugs>	 (03PS1) 10Kosta Harlan: UIC: Disable external permission check for Active wikis section [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192097 (https://phabricator.wikimedia.org/T405889)
[11:03:12] <kostajh>	 jouncebot: nowandnext
[11:03:12] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 56 minute(s)
[11:03:12] <jouncebot>	 In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1300)
[11:03:31] <wikibugs>	 (03PS4) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406)
[11:07:28] <wikibugs>	 (03PS5) 10Btullis: Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406)
[11:07:45] <wikibugs>	 (03PS1) 10Kosta Harlan: SI: Fix sorting by status [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192098 (https://phabricator.wikimedia.org/T405605)
[11:09:55] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add site.pp and preseed.yaml information for dse-k8s-worker200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1192091 (https://phabricator.wikimedia.org/T405406) (owner: 10Btullis)
[11:10:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192098 (https://phabricator.wikimedia.org/T405605) (owner: 10Kosta Harlan)
[11:10:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192097 (https://phabricator.wikimedia.org/T405889) (owner: 10Kosta Harlan)
[11:10:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:59] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1235.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:14:19] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1235.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:14:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye
[11:15:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:22:24] <wikibugs>	 (03Merged) 10jenkins-bot: SI: Fix sorting by status [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192098 (https://phabricator.wikimedia.org/T405605) (owner: 10Kosta Harlan)
[11:22:26] <wikibugs>	 (03Merged) 10jenkins-bot: UIC: Disable external permission check for Active wikis section [extensions/CheckUser] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192097 (https://phabricator.wikimedia.org/T405889) (owner: 10Kosta Harlan)
[11:22:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:23:10] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192098|SI: Fix sorting by status (T405605)]], [[gerrit:1192097|UIC: Disable external permission check for Active wikis section (T405889)]]
[11:23:18] <stashbot>	 T405605: Suggested investigations: Sorting by status doesn't always work - https://phabricator.wikimedia.org/T405605
[11:23:19] <stashbot>	 T405889: Disable external permissions check in UserInfoCard - https://phabricator.wikimedia.org/T405889
[11:23:32] <gehel>	 !log pooling wdqs2007 (caught up on lag)
[11:23:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:30:34] <logmsgbot>	 btullis@cumin1003 reimage (PID 484801) is awaiting input
[11:30:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 13Patch-For-Review: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11224233 (10BTullis) a:05BTullis→03None Should be good to go. Thanks.
[11:31:11] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Update eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1191652 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto)
[11:32:02] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: Change eqiad pod ip range to 10.67.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto)
[11:36:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11224277 (10BTullis) That worked well, it seems. ` btullis@cumin1003:~$ sudo cumin 'an-worker[1209-1232].eqiad.wmnet' 'perccli64 /c0 add vd each r0 wb ra' 24 hosts will be targeted:...
[11:38:10] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] Update eqiad to kubernetes 1.31, calico 3.29 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[11:40:09] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[11:41:56] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert)
[11:42:03] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable bucket sync on production host [puppet] - 10https://gerrit.wikimedia.org/r/1192060 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[11:42:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11224309 (10BTullis) 05Open→03Resolved I think we can resolve this now. I have created T405903 to track adding these hosts to the cluster.
[11:42:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:42:58] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:44:51] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:47:29] <wikibugs>	 (03PS2) 10Jelto: gitlab: enable object storage for packages [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922)
[11:49:52] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192098|SI: Fix sorting by status (T405605)]], [[gerrit:1192097|UIC: Disable external permission check for Active wikis section (T405889)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[11:50:01] <stashbot>	 T405605: Suggested investigations: Sorting by status doesn't always work - https://phabricator.wikimedia.org/T405605
[11:50:02] <stashbot>	 T405889: Disable external permissions check in UserInfoCard - https://phabricator.wikimedia.org/T405889
[11:52:50] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7083/co" [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[11:54:16] <Dreamy_Jazz>	 kostajh: Suggested investigation fix works as expected
[11:54:24] <kostajh>	 Dreamy_Jazz: thanks
[11:54:49] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[12:01:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl_rules_file: fix path for non-cache hit scopes [puppet] - 10https://gerrit.wikimedia.org/r/1192105
[12:03:06] <wikibugs>	 (03CR) 10JMeybohm: Update eqiad to k8s 1.31 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[12:03:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.963s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:04:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl_rules_file: fix path for non-cache hit scopes [puppet] - 10https://gerrit.wikimedia.org/r/1192105 (owner: 10Giuseppe Lavagetto)
[12:06:41] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable object storage for packages [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[12:07:19] <wikibugs>	 (03PS5) 10Stevemunene: druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801)
[12:07:21] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192098|SI: Fix sorting by status (T405605)]], [[gerrit:1192097|UIC: Disable external permission check for Active wikis section (T405889)]] (duration: 44m 11s)
[12:07:30] <stashbot>	 T405605: Suggested investigations: Sorting by status doesn't always work - https://phabricator.wikimedia.org/T405605
[12:07:31] <stashbot>	 T405889: Disable external permissions check in UserInfoCard - https://phabricator.wikimedia.org/T405889
[12:09:51] <jinxer-wm>	 FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:10:50] <wikibugs>	 (03PS1) 10Btullis: Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943)
[12:11:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[12:12:43] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert)
[12:13:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:15:52] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11224409 (10Jelto)
[12:18:17] <wikibugs>	 (03PS11) 10Arnaudb: gerrit: bugfixes on failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833)
[12:18:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "wording fixed" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[12:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:24:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:27:23] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: bugfixes on failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1191431 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[12:28:01] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:35:22] <wikibugs>	 (03PS2) 10Jelto: Update eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703)
[12:37:33] <wikibugs>	 (03CR) 10Jelto: Update eqiad to kubernetes 1.31, calico 3.29 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[12:40:00] <wikibugs>	 (03PS1) 10Brouberol: Define the kafka-mirrromaker kubeconfigs in dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1192117 (https://phabricator.wikimedia.org/T304373)
[12:40:02] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1235.eqiad.wmnet with OS bullseye
[12:40:24] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye
[12:41:00] <wikibugs>	 (03PS2) 10Jelto: Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703)
[12:41:35] <wikibugs>	 (03PS1) 10Brouberol: kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373)
[12:41:57] <wikibugs>	 (03CR) 10Jelto: Update eqiad to k8s 1.31 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[12:44:20] <wikibugs>	 (03PS1) 10Btullis: Add new dummy keytabs for an-launcher1003 [labs/private] - 10https://gerrit.wikimedia.org/r/1192120 (https://phabricator.wikimedia.org/T402943)
[12:45:05] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Add new dummy keytabs for an-launcher1003 [labs/private] - 10https://gerrit.wikimedia.org/r/1192120 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[12:45:39] <wikibugs>	 (03PS2) 10Brouberol: kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373)
[12:46:05] <wikibugs>	 (03PS8) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373)
[12:47:02] <wikibugs>	 (03PS3) 10Brouberol: kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373)
[12:47:02] <wikibugs>	 (03PS9) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373)
[12:47:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[12:48:31] <wikibugs>	 (03PS2) 10Btullis: Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943)
[12:49:54] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1236.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:50:02] <wikibugs>	 (03PS10) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373)
[12:50:05] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7086/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[12:51:19] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1236.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:54:16] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1235.eqiad.wmnet with reason: host reimage
[12:56:09] <wikibugs>	 (03PS9) 10Brouberol: kafka-mirrormaker: define business logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373)
[12:56:12] <wikibugs>	 (03PS4) 10Brouberol: kafka-mirrormaker: initial scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192109 (https://phabricator.wikimedia.org/T304373)
[12:57:55] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye
[12:59:05] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1235.eqiad.wmnet with reason: host reimage
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1300).
[13:00:05] <jouncebot>	 MatmaRex, xSavitar, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <Lucas_WMDE>	 o/
[13:00:17] <xSavitar>	 o/
[13:00:20] <Lucas_WMDE>	 I can deploy!
[13:00:21] <MatmaRex>	 hi
[13:00:25] <MatmaRex>	 thanks Lucas_WMDE :)
[13:00:30] <xSavitar>	 Lucas_WMDE okay sir 🙏🏽, thanks
[13:00:41] <Lucas_WMDE>	 xSavitar: do you want to self-service your deployment?
[13:00:57] <MatmaRex>	 oh, you've improved my backport note, thanks
[13:00:58] <xSavitar>	 Lucas_WMDE, you can deploy it, I'll test :)
[13:01:02] <Lucas_WMDE>	 ok
[13:01:05] <Lucas_WMDE>	 let’s start with that
[13:01:10] <Lucas_WMDE>	 and run gate-and-submit for the backport in the meantime
[13:01:11] <xSavitar>	 Okay
[13:01:13] <Lucas_WMDE>	 MatmaRex: yeah :)
[13:01:21] <Lucas_WMDE>	 thought it’d be useful in case someone else ended up running it ^^
[13:01:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[13:03:01] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:03:26] <Lucas_WMDE>	 feel like that config change is taking longer than usual to merge o_O
[13:03:31] <Lucas_WMDE>	 what is tox doing https://integration.wikimedia.org/ci/job/operations-mw-config-tox/9024/console
[13:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: session: Enable MultiBackendSessionStore on `group0` wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[13:03:40] <xSavitar>	 Lucas_WMDE maybe you spoke too soon?
[13:03:51] <xSavitar>	 :)
[13:03:52] <Lucas_WMDE>	 not really, that was still longer than I would expect ^^
[13:03:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1183132|session: Enable MultiBackendSessionStore on `group0` wikis (T402808)]]
[13:04:00] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[13:04:37] <Lucas_WMDE>	 e.g. on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1191514 that build took 39s instead of 1m44s
[13:04:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:05:00] <xSavitar>	 Hm! Interesting... looking...
[13:06:15] <xSavitar>	 Lucas_WMDE not sure what is the issue but if it persists maybe we can file a Phab task or ask around if something has changed recently (since last week)?
[13:06:16] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] FixRenameUserLocalLogs: Improve matching for users renamed multiple times (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:07:46] <Lucas_WMDE>	 xSavitar: I think it’s fine to just leave it for now
[13:07:57] <xSavitar>	 Could it also be because this is the first deploy for the week? :)
[13:08:00] <Lucas_WMDE>	 the other recent builds at https://integration.wikimedia.org/ci/job/operations-mw-config-tox/ were all faster
[13:08:38] * xSavitar will hunt down first deploys for the week and see if there are any clues.
[13:09:19] <xSavitar>	 Lucas_WMDE, Ack! I just looked at random patches last week and they seem to run for about 40s max
[13:09:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde: Backport for [[gerrit:1183132|session: Enable MultiBackendSessionStore on `group0` wikis (T402808)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:09:49] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[13:10:03] <xSavitar>	 Should I test?
[13:10:35] <Lucas_WMDE>	 yes please :)
[13:10:43] <xSavitar>	 Ack! Testing now...
[13:12:10] <wikibugs>	 (03CR) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Improve matching for users renamed multiple times (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:13:12] <wikibugs>	 (03CR) 10Ssingh: "Looks good but can also be removed from modules/profile/data/profile/installserver/preseed.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene)
[13:13:42] <wikibugs>	 (03CR) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Improve matching for users renamed multiple times (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:13:48] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "retracting +2 in case we want to change something" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:14:19] <xSavitar>	 Tested so far on mediawikiwiki, officewiki and testwikidatawiki and everything seems to work fine.
[13:14:23] <xSavitar>	 Lucas_WMDE, you can sync, thank you
[13:14:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde: Continuing with sync
[13:14:38] <Lucas_WMDE>	 thanks!
[13:14:40] <MatmaRex>	 Lucas_WMDE: i guess we can backport the followup too, thanks for spotting that
[13:14:45] <Lucas_WMDE>	 ok, sounds good
[13:15:15] <Lucas_WMDE>	 MatmaRex: do you have someone around who can CR+2 the follow-up on master?
[13:15:20] <Lucas_WMDE>	 or should I be brave and do it? ^^
[13:15:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[13:15:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[13:15:44] <MatmaRex>	 xSavitar, possibly ;) can you have a look at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1192128 ?
[13:16:16] <xSavitar>	 MatmaRex looking...
[13:17:53] <xSavitar>	 Responding to Lucas' comment otherwise looks fine.
[13:18:02] <xSavitar>	 *Responded
[13:18:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[13:18:44] <MatmaRex>	 sure, done
[13:18:46] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[13:21:07] <Lucas_WMDE>	 I quickly added Bug: T398177 to the commit message before xSavitar +2s it ;)
[13:21:08] <stashbot>	 T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177
[13:21:25] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1236.eqiad.wmnet with OS bullseye
[13:21:39] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Ensure field subquery returns just 1 result [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192130 (https://phabricator.wikimedia.org/T398177)
[13:21:39] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[13:21:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1183132|session: Enable MultiBackendSessionStore on `group0` wikis (T402808)]] (duration: 17m 52s)
[13:21:51] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[13:22:02] <xSavitar>	 Lucas_WMDE, Ack!
[13:22:03] <MatmaRex>	 thanks. and that's the backport
[13:22:15] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[13:22:16] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1235.eqiad.wmnet with OS bullseye
[13:23:22] <Lucas_WMDE>	 oh, and I guess we should backport that second one to wmf.21 also
[13:23:23] <Lucas_WMDE>	 wait
[13:23:29] <Lucas_WMDE>	 no. hasn’t been branched yet ^^
[13:23:47] <xSavitar>	 Lucas_WMDE, right, it hasn't been branched yet: https://versions.toolforge.org/
[13:23:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:23:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192130 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:24:07] <Lucas_WMDE>	 I suspect the first one will merge very quickly thanks to the CI success result cache
[13:24:15] <Lucas_WMDE>	 the second one will still need a full gate-and-submit though
[13:25:29] <xSavitar>	 Lucas_WMDE, wait, so you mean when gate-and-submit runs, the results get cached and even after it gets interrupted, then retriggered, it doesn't do a full run? Nice. What if the patch changed in the meantime?
[13:25:53] <wikibugs>	 (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Improve matching for users renamed multiple times [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:25:59] <xSavitar>	 ^^
[13:26:09] <Lucas_WMDE>	 the cache is only used if it’s the same Git commit being tested
[13:26:23] <Lucas_WMDE>	 both in the repo to which the test belongs and also in all dependent repositories, I believe
[13:26:33] <xSavitar>	 Nice, that's a neat feature. Kudos to the CI/CD lords around here.
[13:26:35] <Lucas_WMDE>	 so no the master branch it’s quite rare to see a cache hit afaik
[13:26:39] <Lucas_WMDE>	 *on the master branch
[13:26:50] <Lucas_WMDE>	 because by the time you try the gate-and-submit again, something else probably got merged already
[13:26:51] <xSavitar>	 Ack
[13:26:53] <Lucas_WMDE>	 but it’s useful for backport branches
[13:26:56] <Lucas_WMDE>	 https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/KTP34HIR5D66QLGHC3ZAIZKQWE46O5F4/ was the announcement
[13:27:27] <xSavitar>	 thanks for the link. Will read.
[13:27:35] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye
[13:29:02] <Lucas_WMDE>	 I’ll do my deployment-charts change in parallel, should have no effect on each other
[13:29:10] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191692 (https://phabricator.wikimedia.org/T405720) (owner: 10Lucas Werkmeister (WMDE))
[13:30:49] <wikibugs>	 (03PS10) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246)
[13:31:11] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191692 (https://phabricator.wikimedia.org/T405720) (owner: 10Lucas Werkmeister (WMDE))
[13:31:13] <MatmaRex>	 i need to step away for a bit, i'll be back in 15 minutes or so
[13:31:19] <Lucas_WMDE>	 ok
[13:33:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[13:33:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[13:34:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[13:34:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[13:35:05] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[13:35:13] <wikibugs>	 (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Ensure field subquery returns just 1 result [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192130 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[13:35:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1191495|FixRenameUserLocalLogs: Improve matching for users renamed multiple times (T398177)]], [[gerrit:1192130|FixRenameUserLocalLogs: Ensure field subquery returns just 1 result (T398177)]]
[13:35:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[13:35:44] <stashbot>	 T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177
[13:36:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[13:36:41] <Lucas_WMDE>	 xSavitar, while MatmaRex is afk: there’s nothing to test on mwdebug for those two backports, right?
[13:36:45] <Lucas_WMDE>	 since they only affect a maintenance script
[13:38:25] <xSavitar>	 Yes sir!
[13:38:48] <xSavitar>	 MatmaRex plans to do a dry run, investigate the output and then kicks the script again afterwards
[13:38:58] <xSavitar>	 So, I think you can sync
[13:40:02] <Lucas_WMDE>	 yeah, I’ll just do the dry run after the sync is done
[13:40:11] <xSavitar>	 Okay
[13:40:35] * Lucas_WMDE is done with the deployment-charts kubernetes deploy ftr
[13:41:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 matmarex, lucaswerkmeister-wmde: Backport for [[gerrit:1191495|FixRenameUserLocalLogs: Improve matching for users renamed multiple times (T398177)]], [[gerrit:1192130|FixRenameUserLocalLogs: Ensure field subquery returns just 1 result (T398177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:41:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:41:42] <stashbot>	 T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177
[13:42:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 matmarex, lucaswerkmeister-wmde: Continuing with sync
[13:42:39] <logmsgbot>	 btullis@cumin1003 reimage (PID 502318) is awaiting input
[13:43:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[13:44:51] <wikibugs>	 (03PS3) 10Btullis: Configure an-launcher1003 with its role, but absent job timers [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943)
[13:46:09] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7087/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192107 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[13:46:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[13:47:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191495|FixRenameUserLocalLogs: Improve matching for users renamed multiple times (T398177)]], [[gerrit:1192130|FixRenameUserLocalLogs: Ensure field subquery returns just 1 result (T398177)]] (duration: 11m 24s)
[13:47:09] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11224627 (10MatthewVernon) Hi @VRiley-WMF do you think you'll be able to do these swaps this week, please?
[13:47:11] <stashbot>	 T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177
[13:48:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist sul CentralAuth:FixRenameUserLocalLogs --logwiki=metawiki  # T398177 (dry run)
[13:48:22] <Lucas_WMDE>	 !log UTC afternoon backport+config window done (CentralAuth:FixRenameUserLocalLogs maintenance script will keep running for a few hours)
[13:48:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:34] <Lucas_WMDE>	 awight: ^ if you wanted to deploy something
[13:48:38] <wikibugs>	 (03CR) 10Btullis: [C:04-1] "Do we need to vendor these modules in the operator chart at all?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[13:48:42] <xSavitar>	 Lucas_WMDE, thank you very much for deploying.
[13:49:36] <Lucas_WMDE>	 yw :)
[13:51:08] <awight>	 Lucas_WMDE: are you doing the deployment-charts now, and if so may I sneak in a few minutes of mw maintenance script run?
[13:51:57] <Lucas_WMDE>	 I already did the deployment-charts
[13:52:04] <Lucas_WMDE>	 I’m running a maintenance script but I assume you can run another one
[13:52:14] <Lucas_WMDE>	 (the one for T398177 will take some more hours)
[13:52:15] <stashbot>	 T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177
[13:53:04] <awight>	 Lucas_WMDE: okay I'll go ahead and try that.  Just let me know if it seems to cause problems.  I think this is will take 10-60s, to purge 500 or so pages.
[13:54:03] <Lucas_WMDE>	 sounds good
[13:55:03] * MatmaRex back
[13:55:29] <MatmaRex>	 thanks Lucas_WMDE
[13:55:57] <Lucas_WMDE>	 np
[13:57:22] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ms-be[1086-1088].eqiad.wmnet with reason: awaiting controller swap
[13:57:34] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11224717 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f05e1660-c13c-4689-a96d-eaccf6967088) set by mvernon@cu...
[13:58:08] <logmsgbot>	 !log awight@deploy2002 mwscript-k8s job started: purgePage.php --wiki=dewiki  # T389363
[13:58:14] <stashbot>	 T389363: Fix attribute order round-tripping for sub-references (dirty diff) - https://phabricator.wikimedia.org/T389363
[13:59:06] <awight>	 Lucas_WMDE: All done, good luck with your longer run!
[14:00:25] <Lucas_WMDE>	 thanks!
[14:02:55] <wikibugs>	 06SRE, 06Traffic: "Backend fetch failed" on edit save - https://phabricator.wikimedia.org/T382790#11224731 (10ssingh) 05Open→03Resolved a:03ssingh This has been open for a while and there hasn't been any follow up from either side. @MGChecker: Please re-open if this issue still persists for you. Thanks!
[14:05:55] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli)
[14:06:46] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1236.eqiad.wmnet with OS bullseye
[14:07:14] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye
[14:07:18] <Lucas_WMDE>	 MatmaRex: FWIW, so far I think I’m seeing the same number of “User has existed, but no local log entry” output rows
[14:07:33] <Lucas_WMDE>	 but the three “More than one matching local log entry for global” ones from abwiki went away
[14:07:46] <Lucas_WMDE>	 actually, the very last “User has existed, but no local log entry for global #49887933” line on abwiki went away too
[14:07:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917 (10Maria_Lechner_WMDE) 03NEW
[14:08:11] <MatmaRex>	 nice
[14:09:05] <wikibugs>	 (03PS18) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[14:09:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#11224783 (10ssingh) There has been no follow-up on this after Jun 2024. @MatthewVernon: should we keep this open?
[14:09:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405843#11224785 (10phaultfinder)
[14:10:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[14:10:37] <wikibugs>	 (03PS1) 10Marco Fossati: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259)
[14:11:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati)
[14:11:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#11224790 (10ssingh) 05Open→03Resolved a:03ssingh We have made progress in T301605, and specific to this task, we ramped up tra...
[14:13:40] <Lucas_WMDE>	 there’s still some “More than one matching local log entry for global” though, amwiki has two
[14:14:17] <wikibugs>	 (03CR) 10Joal: Replace old sqoop wiki list file with new autoupdated file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191750 (owner: 10Snwachukwu)
[14:14:51] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[14:14:52] <Lucas_WMDE>	 MatmaRex: “amwiki Would update performer for local #101754 based on global #59763297 from 'Nahomnata' to 'J ansari'”
[14:15:03] <Lucas_WMDE>	 does that sound right? I thought the second part of https://phabricator.wikimedia.org/T398177#11146083 meant these shouldn’t have happened 🤔
[14:15:17] <Lucas_WMDE>	 (feel free to wait with the answer until it’s done and I’ve posted the full logs, of course :P)
[14:15:35] <Lucas_WMDE>	 (there’s one other “would update” in amwiki, from <INVALID>)
[14:16:01] <MatmaRex>	 in a meeting right now, i'll look later
[14:17:03] <wikibugs>	 (03CR) 10Phuedx: "@kharlan@wikimedia.org: Yes. I'm a little unsure as to why you're doing this here rather than in the ConfirmEdit extension but I'm out of " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[14:18:48] <wikibugs>	 (03CR) 10Reedy: "Avoiding putting WMF specific stuff (assumptions etc) into a bundled/tarballed extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[14:18:52] <Lucas_WMDE>	 ok :)
[14:19:52] <wikibugs>	 (03CR) 10Reedy: WIP hCaptcha: Enable A/B test for frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[14:21:13] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1236.eqiad.wmnet with reason: host reimage
[14:21:36] <wikibugs>	 (03CR) 10Kosta Harlan: "Yes, what @reedy@wikimedia.org said -- this is WMF specific, so the logic belongs here (for lack of better place, see also T401939)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[14:21:48] <wikibugs>	 (03PS1) 10Scott French: haproxy acl naming refactor and minor UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192145
[14:23:33] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.discovery.service-route check toolhub: maintenance
[14:23:33] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check toolhub: maintenance
[14:23:59] <wikibugs>	 (03PS3) 10Kosta Harlan: hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239)
[14:24:11] <wikibugs>	 (03PS6) 10Stevemunene: druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801)
[14:24:11] <wikibugs>	 (03PS1) 10Stevemunene: remove mention of druid10[07-08] in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1192147 (https://phabricator.wikimedia.org/T403801)
[14:24:17] <wikibugs>	 (03PS1) 10Kosta Harlan: Hooks: Enable overriding the hook instance per action [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192148 (https://phabricator.wikimedia.org/T405239)
[14:24:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[14:24:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene)
[14:24:51] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:24:54] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1236.eqiad.wmnet with reason: host reimage
[14:25:07] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Actually sorry, my bad. site.pp can be updated as well if you want but +1." [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene)
[14:26:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11224867 (10WMDE-leszek) I approve this request on WMDE's end. Thank you!
[14:26:11] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Tested locally at 5a6390f" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192145 (owner: 10Scott French)
[14:26:18] <wikibugs>	 (03CR) 10Stevemunene: "No worries, I added another patch for that I3bca4291156dec81bb03d37eb66d1a9a5aa3cab4." [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene)
[14:27:02] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] remove mention of druid10[07-08] in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1192147 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene)
[14:28:25] <wikibugs>	 (03PS2) 10Brouberol: Define the kafka-mirrromaker kubeconfigs in all dse-k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1192117 (https://phabricator.wikimedia.org/T304373)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1430)
[14:30:53] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] haproxy acl naming refactor and minor UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1192145 (owner: 10Scott French)
[14:31:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192148 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[14:33:37] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy haproxy acl naming refactor and minor UI improvements - swfrench@cumin2002"
[14:33:39] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy haproxy acl naming refactor and minor UI improvements - swfrench@cumin2002
[14:34:33] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy haproxy acl naming refactor and minor UI improvements - swfrench@cumin2002
[14:34:35] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy haproxy acl naming refactor and minor UI improvements - swfrench@cumin2002"
[14:34:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11224891 (10Jclark-ctr) @VRiley-WMF   T404103 optics have arrived and are or cart between rows C/D.   Please connect all the cables you preran and update CableIDs in Netbox https://netbox.wiki...
[14:35:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11224893 (10Jclark-ctr) @cmooney  all fibers for  ssw1-d1-eqiad  have been connected  except cr1-eqiad ,ssw1-e1-eqiad  ,ssw1-f1-eqiad
[14:38:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno)
[14:39:16] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#11224904 (10MatthewVernon) I guess not, if it has recurred, it's not been enough to page...
[14:39:58] <wikibugs>	 (03PS19) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[14:40:11] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#11224908 (10ssingh) 05Open→03Resolved a:03ssingh OK thank you. I am marking this as resolved for now. We can re-open as required.
[14:41:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[14:43:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11224919 (10Jclark-ctr) a:05BTullis→03Jclark-ctr
[14:45:06] <wikibugs>	 (03PS11) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373)
[14:45:51] <wikibugs>	 (03PS20) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[14:45:55] <wikibugs>	 (03PS12) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373)
[14:46:59] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[14:47:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[14:48:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11224950 (10Jelto)
[14:49:10] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11224955 (10Jelto)
[14:50:03] <logmsgbot>	 btullis@cumin1003 reimage (PID 505620) is awaiting input
[14:51:13] <wikibugs>	 (03CR) 10CDanis: [C:03+1] admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[14:51:32] <wikibugs>	 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11224972 (10elukey) p:05Triage→03Medium
[14:51:53] <wikibugs>	 (03CR) 10CDanis: [C:03+1] admin/data: add the analytics-wikidata system user and user groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[14:52:11] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: DeprecationWarning: datetime.datetime.utcnow() is deprecated - https://phabricator.wikimedia.org/T401581#11224976 (10elukey) p:05Triage→03Low
[14:52:24] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: DeprecationWarning: datetime.datetime.utcnow() is deprecated - https://phabricator.wikimedia.org/T401581#11224978 (10elukey)
[14:55:43] <wikibugs>	 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11224991 (10JTweed-WMF)
[14:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:01:55] <icinga-wm>	 RECOVERY - snapshot of s4 in eqiad on backupmon1001 is OK: Last snapshot for s4 at eqiad (db1245) taken on 2025-09-29 13:24:43 (1904 GiB, +2.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[15:02:11] <logmsgbot>	 !log stevemunene@puppetserver1001 conftool action : set/pooled=no; selector: service=(druid-public-broker),name=druid1007.eqiad.wmnet
[15:02:30] <logmsgbot>	 !log stevemunene@puppetserver1001 conftool action : set/pooled=no; selector: service=(druid-public-broker),name=druid1008.eqiad.wmnet
[15:03:29] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Define the kafka-mirrromaker kubeconfigs in all dse-k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1192117 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[15:03:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] kafka-mirrormaker: define the namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192118 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[15:04:34] <wikibugs>	 06SRE, 06serviceops, 06Traffic: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800#11225029 (10ssingh) This is being moved on the Traffic workboard to "Radar/Not for service" as I don't think there is anything on our end to do here. Please let me know if you...
[15:04:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] kafka-mirrormaker: initial scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192109 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[15:04:54] <wikibugs>	 06SRE, 06serviceops, 06Traffic: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800#11225030 (10ssingh) And to be clear, by that I mean that this change is better suited for MW and not the CDN.
[15:06:16] <wikibugs>	 (03CR) 10Btullis: [C:03+1] kafka-mirrormaker: define business logic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192110 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[15:07:53] <wikibugs>	 (03CR) 10Btullis: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[15:08:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11225041 (10Jhancock.wm) Update on the cp2056. Finally got Dell to agree to send a replacement card after a week of back of forth and escalations. So that shoul...
[15:08:39] <icinga-wm>	 PROBLEM - Druid broker on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:08:39] <icinga-wm>	 PROBLEM - Druid coordinator on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:08:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:08:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:08:41] <icinga-wm>	 PROBLEM - Druid overlord on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:09:05] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene)
[15:09:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:16] <sukhe>	 ^ stevemunene is removing druid1007-8
[15:09:17] <sukhe>	 so this might be that
[15:10:21] <wikibugs>	 (03CR) 10Brouberol: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[15:10:39] <icinga-wm>	 PROBLEM - Druid overlord on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:10:39] <icinga-wm>	 PROBLEM - Druid historical on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:10:39] <icinga-wm>	 PROBLEM - Druid broker on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:10:39] <icinga-wm>	 PROBLEM - Druid coordinator on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:10:41] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:11:34] <stevemunene>	 Druid errors are from Decommissioning druid services on druid100[7-8]  for T403801
[15:11:35] <stashbot>	 T403801: decommission druid100[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T403801
[15:13:03] <wikibugs>	 (03CR) 10Ahmon Dancy: "The addition of the profile::puppetserver::volatile::cdn_private_git_token lookup has broken puppet on deployment-puppetserver-1.deploymen" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede)
[15:14:32] <wikibugs>	 (03PS13) 10Brouberol: kafka-mirrormaker: define helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373)
[15:14:35] <wikibugs>	 (03CR) 10Brouberol: kafka-mirrormaker: define helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192111 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[15:16:30] <wikibugs>	 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11225074 (10dancy) Noting that puppet on `deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud` is currently broken due to the addition of the `profile::puppetserver::volatile::cdn_private_git_token` look...
[15:17:02] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 10Spicerack: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397#11225075 (10Scott_French) 05Open→03Resolved a:03Scott_French Amazing - thank you very much,...
[15:19:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Define the kafka-mirrromaker kubeconfigs in all dse-k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1192117 (https://phabricator.wikimedia.org/T304373) (owner: 10Brouberol)
[15:22:12] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: automatically figure out some values to reduce release config size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol)
[15:24:31] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@94fa387]: Add authenticated mw_context values
[15:24:33] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@94fa387]: Add authenticated mw_context values (duration: 00m 02s)
[15:24:39] <icinga-wm>	 RECOVERY - Druid coordinator on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:24:39] <icinga-wm>	 RECOVERY - Druid broker on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:24:39] <icinga-wm>	 RECOVERY - Druid historical on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:24:39] <icinga-wm>	 RECOVERY - Druid overlord on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:24:41] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:26:01] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@94fa387]: Add authenticated mw_context values
[15:26:04] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@94fa387]: Add authenticated mw_context values (duration: 00m 02s)
[15:26:40] <logmsgbot>	 !log tappof@deploy2002 Started restart [performance/navtiming@94fa387]: Add authenticated mw_context values
[15:27:19] <wikibugs>	 (03PS21) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[15:28:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[15:29:12] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07Wikimedia-Performance-recommendation: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911#11225132 (10ssingh) What is the update on this, given that it has been a while and I am a bit confused reading the text and trying to...
[15:30:05] <jouncebot>	 jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1530).
[15:33:35] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@94fa387]: Add authenticated mw_context values
[15:33:37] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@94fa387]: Add authenticated mw_context values (duration: 00m 02s)
[15:34:03] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11225146 (10elukey) We have configured Tegola and Kartotherian in prod-codfw to use the new postgres stack, but I am seeing some errors in Kartotherian like the following:  ` {"name":...
[15:34:51] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:35:39] <icinga-wm>	 RECOVERY - Druid coordinator on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:35:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:35:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:35:39] <icinga-wm>	 RECOVERY - Druid broker on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:35:41] <icinga-wm>	 RECOVERY - Druid overlord on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:36:17] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@94fa387]: Add authenticated mw_context values
[15:36:19] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@94fa387]: Add authenticated mw_context values (duration: 00m 02s)
[15:36:45] <wikibugs>	 06SRE, 06Traffic: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431#11225153 (10ssingh) 05Open→03Resolved a:03ssingh We have had this for a while and the responses are padded. Marking as resolved.
[15:37:54] <wikibugs>	 (03PS22) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[15:38:21] <wikibugs>	 06SRE, 06Traffic: Performance implications of buffer sizes in Apache Traffic Server intercept plugins - https://phabricator.wikimedia.org/T287847#11225162 (10ssingh) 05Open→03Resolved a:03ssingh This was merged upstream in 9.2.x so we have inherited this change. Since we have not revisited this since...
[15:38:30] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11225165 (10elukey) I applied `/usr/local/bin/maps-grants-gis.sql` on maps2011 and now the grants are better:  ` gis=# SELECT * FROM information_schema.role_table_grants where grantee...
[15:39:10] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:39:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[15:39:28] <wikibugs>	 (03PS23) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[15:40:04] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.211.0" for 168 host(s)
[15:40:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[15:41:59] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values
[15:42:02] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 02s)
[15:43:02] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values
[15:43:05] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 02s)
[15:44:03] <logmsgbot>	 !log tappof@deploy2002 Started restart [performance/navtiming@578b1d3]: Add authenticated mw_context values
[15:44:08] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.211.0" completed for 168 hosts
[15:44:10] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 07Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009#11225214 (10ssingh) I am curious: should we keep this open or should this be resolved now given that we have...
[15:44:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "@cdanis@wikimedia.org: I guess we should merge this today; let me know and happy to take care of that." [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis)
[15:44:51] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:44:56] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] haproxy: use Lua 5.3 for Docker tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis)
[15:45:00] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values
[15:45:02] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 02s)
[15:45:21] <logmsgbot>	 !log dancy@deploy2002 Started scap sync-world: Testing gitinfo fix (T405738)
[15:45:27] <stashbot>	 T405738: Debug scap partial deployment, 25 Sept 2025 - https://phabricator.wikimedia.org/T405738
[15:46:31] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values
[15:46:41] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 15s)
[15:48:02] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11225236 (10ABran-WMF) a:03ABran-WMF
[15:49:13] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Very sad that nowadays we see these huge amount of yaml, but we cannot really do anything differently :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[15:51:28] <logmsgbot>	 !log tappof@deploy2002 Started deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values
[15:51:30] <logmsgbot>	 !log tappof@deploy2002 Finished deploy [performance/navtiming@578b1d3]: Add authenticated mw_context values (duration: 00m 02s)
[15:51:37] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225259 (10VRiley-WMF) Hey @MatthewVernon Yes, I am planning on doing this today. I apologize as I was out for two days last week.
[15:52:26] <logmsgbot>	 !log tappof@deploy2002 Started restart [performance/navtiming@578b1d3]: Add authenticated mw_context values
[15:52:29] <cdanis>	 sukhe: please feel free to merge both of those patches if you like :D
[15:52:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11225279 (10Papaul)
[15:52:33] <cdanis>	 I can also get around to it soon, in a meeting now
[15:53:14] <sukhe>	 cdanis: happy to take care of them
[15:53:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] taskgen: add haproxy Lua tests [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis)
[15:54:24] <fabfur>	 !log restart haproxy on cp5021 to test utf8ps converter
[15:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:42] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] haproxy: use Lua 5.3 for Docker tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis)
[15:55:27] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5021.eqsin.wmnet
[15:56:23] <wikibugs>	 (03CR) 10Ssingh: "Thanks for reporting the broken CI, unrelated to this change. @cdanis@wikimedia.org fixed this and that change has been merged. Can you pl" [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert)
[15:56:37] <logmsgbot>	 !log dancy@deploy2002 Finished scap sync-world: Testing gitinfo fix (T405738) (duration: 11m 16s)
[15:56:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:56:44] <stashbot>	 T405738: Debug scap partial deployment, 25 Sept 2025 - https://phabricator.wikimedia.org/T405738
[15:56:57] <wikibugs>	 (03PS2) 10Jelto: Update eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1191652 (https://phabricator.wikimedia.org/T375845)
[15:57:10] <wikibugs>	 (03PS3) 10Jelto: Update eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703)
[15:57:27] <wikibugs>	 (03PS2) 10Clément Goubert: taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845)
[15:57:38] <wikibugs>	 (03PS5) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859)
[15:57:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:58:31] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet
[15:59:38] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225331 (10MatthewVernon) Cool, thanks :)
[16:00:42] <wikibugs>	 (03CR) 10Clément Goubert: "Thank you both for fixing it!" [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert)
[16:00:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I didn't find anything strange, it was a lot of yaml but I didn't find:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[16:02:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:02:46] <wikibugs>	 (03PS4) 10Kosta Harlan: hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239)
[16:04:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11225371 (10elukey) @Jhancock.wm me and Jesse are running out of ideas, if you have time could you please open the host and check if the bus between the BMC and the motherboard et...
[16:05:18] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[16:05:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11225378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet wi...
[16:06:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[16:06:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[16:07:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11225393 (10Papaul) I had a meeting today with @Jgreen about the new switch configuration. what we will be doing is to move the...
[16:07:20] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048']
[16:07:29] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2048']
[16:09:21] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048']
[16:09:37] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048']
[16:09:51] <jinxer-wm>	 FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:11:13] <wikibugs>	 (03PS24) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[16:11:25] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048']
[16:11:37] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048']
[16:12:20] <wikibugs>	 (03PS25) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[16:13:44] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:13:50] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:14:00] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:14:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[16:14:37] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:16:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405843#11225482 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there are 6 servers on the EOL list in this rack. removing thresholds and adding to tracking task
[16:18:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b3-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405755#11225505 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there are at least 6 servers in this rack that are on the EOL list. Removing alerting and addin...
[16:20:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11225508 (10Jhancock.wm)
[16:23:47] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:25:13] <wikibugs>	 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11225571 (10MatthewVernon) @elukey re the triage priority - if there's a problem with our standard UEFI setup for re-imaging ms* node...
[16:25:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11225573 (10elukey) The firmware cookbook doesn't work yet since spicerack is configured to look for a `HttpPushUri` field in the Redfish's UpdateService endpoi...
[16:28:07] <wikibugs>	 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11225583 (10elukey) >>! In T404356#11225571, @MatthewVernon wrote: > @elukey re the triage priority - if there's a problem with our s...
[16:28:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[16:29:04] <wikibugs>	 10ops-eqsin: Inbound errors on interface cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) - https://phabricator.wikimedia.org/T405938 (10phaultfinder) 03NEW
[16:29:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[16:35:38] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940 (10RobH) 03NEW
[16:36:15] <wikibugs>	 (03PS26) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[16:36:38] <wikibugs>	 (03PS1) 10Papaul: Add new Frack switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1192171 (https://phabricator.wikimedia.org/T405618)
[16:36:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[16:36:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[16:36:46] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11225644 (10RobH) a:03LSobanski @lsobanski,  I'm not exactly sure who in your team should be the point of contact for the migration of these hosts (list...
[16:37:50] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 (10RobH) 03NEW
[16:38:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[16:39:02] <wikibugs>	 (03PS2) 10Papaul: Add new Frack switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1192171 (https://phabricator.wikimedia.org/T405618)
[16:39:45] <wikibugs>	 (03CR) 10Harroyo-wmf: [C:03+1] hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[16:39:58] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11225664 (10RobH) a:03KOfori @kofori,  I'm assigning this to you as team manager for feedback on who I should work with as the point of contact for the migration of...
[16:41:06] <wikibugs>	 (03PS2) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502
[16:41:12] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11225673 (10ssingh) Note that @KOfori is out, this should be directed to @Kappakayala in the meantime.
[16:41:18] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add new Frack switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1192171 (https://phabricator.wikimedia.org/T405618) (owner: 10Papaul)
[16:41:46] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943 (10RobH) 03NEW
[16:42:19] <wikibugs>	 (03PS3) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502
[16:42:30] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:42:49] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:44:34] <wikibugs>	 (03PS4) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502
[16:45:36] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:46:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945 (10RobH) 03NEW
[16:46:43] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:48:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11225731 (10RobH) a:05RobH→03joanna_borun Joanna,  I'm not exactly sure who on your team to assign this as point of contact, so I'm assigning...
[16:49:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946 (10RobH) 03NEW
[16:49:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11225747 (10CDanis) a:05joanna_borun→03LSobanski
[16:51:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11225769 (10RobH) a:03herron @herron or @colewhite  (not sure which of you is best to handle this, please reassign as needed!)  I'm looking to get some feedback for the s...
[16:51:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11225772 (10RobH)
[16:52:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948 (10RobH) 03NEW
[16:53:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey)
[16:53:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11225795 (10RobH) a:05RobH→03Gehel @gehel,  I'm not sure who would be the best point of contact within Search SRE to coordinate with for the migration of the above...
[16:56:02] <wikibugs>	 (03CR) 10Bearloga: hCaptcha: Enable A/B test for frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[16:57:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[16:58:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[16:59:45] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11225841 (10RobH) a:03BTullis @btullis,  After asking Guillaume he said I should work with you as point of contact for these migrations (though that you would still b...
[17:00:04] <jouncebot>	 swfrench-wmf and dancy: MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1700). Please do the needful.
[17:00:04] <jouncebot>	 ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T1700).
[17:00:16] <dancy>	 o/
[17:00:41] <swfrench-wmf>	 o/
[17:00:44] <dancy>	 swfrench-wmf: The new release of scap is ready to deploy
[17:01:09] <swfrench-wmf>	 great, I'm running some pre-flight checks to make sure there aren't any latent diffs that will get in our way
[17:01:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11225848 (10RobH) a:05Gehel→03bking After irc chat with @gehel he suggested this should assign over to @bking for coordination (but it will still be discussed with...
[17:01:18] <wikibugs>	 06SRE, 13Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460#11225852 (10Krinkle)
[17:02:10] <swfrench-wmf>	 dancy: looks like we're good to go. feel free to go ahead and deploy scap, then we can run a stop-before-sync deploy as discussed.
[17:02:43] <dancy>	 OK.  
[17:03:06] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.212.0" for 2 host(s)
[17:03:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950 (10RobH) 03NEW
[17:04:29] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Discovery-Search: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11225872 (10RobH)
[17:04:55] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.212.0" completed for 2 hosts
[17:05:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[17:05:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[17:05:56] <logmsgbot>	 !log dancy@deploy2002 Started scap sync-world: Testing T405110
[17:06:03] <stashbot>	 T405110: Allow the same namespace name to be used in different clusters - https://phabricator.wikimedia.org/T405110
[17:06:43] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+1] ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati)
[17:06:44] <logmsgbot>	 !log dancy@deploy2002 Stopping before sync operations
[17:07:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11225895 (10RobH) @Kappakayala,  I'm not exactly sure who in your team would be the best point of contact for the above migration list, as it covers multiple service groups.  Th...
[17:07:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11225896 (10RobH) a:03Kappakayala
[17:08:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11225899 (10Jhancock.wm) @Papaul Hey we've gotten the pressed and site.pp files cofigured correctly as far as i can tell but still getting this o...
[17:08:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11225900 (10Jhancock.wm) a:05Jhancock.wm→03Papaul
[17:08:29] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11225901 (10RobH)
[17:09:48] <swfrench-wmf>	 dancy: I'm looking at the updated contents of `/etc/helmfile-defaults/mediawiki/release` and I think this looks good
[17:09:57] <dancy>	 Agreed
[17:10:22] <swfrench-wmf>	 alright, I'll merge your helmfile patch, and then run the diffs again
[17:10:30] <dancy>	 OK. Standing by
[17:10:35] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mediawiki services: Update path to scap-created yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189905 (https://phabricator.wikimedia.org/T405110) (owner: 10Ahmon Dancy)
[17:10:45] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:13:16] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki services: Update path to scap-created yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189905 (https://phabricator.wikimedia.org/T405110) (owner: 10Ahmon Dancy)
[17:13:38] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: support environment in release values file name [puppet] - 10https://gerrit.wikimedia.org/r/1191751 (https://phabricator.wikimedia.org/T405110) (owner: 10Scott French)
[17:14:36] <swfrench-wmf>	 papaul: good to merge your hieradata changes?
[17:14:39] <wikibugs>	 (03PS1) 10Dzahn: zuul: move new zuul nodepool setup to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938)
[17:15:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zuul: move new zuul nodepool setup to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[17:15:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11225919 (10Jclark-ctr) Your dispatch shipped on 9/29/2025 11:56 AM
[17:16:31] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225920 (10VRiley-WMF) 05Open→03In progress Starting work on ms-be1087 (will get to ms-be1086 in a bit. starting with the cage...
[17:16:36] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:17:26] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[17:17:32] <wikibugs>	 (03PS2) 10Dzahn: zuul: move new zuul nodepool setup to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938)
[17:17:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11225925 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with O...
[17:18:58] <logmsgbot>	 !log dancy@deploy2002 Started scap sync-world: Testing T405110 (v2)
[17:19:05] <stashbot>	 T405110: Allow the same namespace name to be used in different clusters - https://phabricator.wikimedia.org/T405110
[17:26:18] <logmsgbot>	 !log dancy@deploy2002 Finished scap sync-world: Testing T405110 (v2) (duration: 07m 20s)
[17:26:25] <stashbot>	 T405110: Allow the same namespace name to be used in different clusters - https://phabricator.wikimedia.org/T405110
[17:27:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[17:28:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[17:28:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11225940 (10Maria_Lechner_WMDE) I have not signed an NDA yet, I'm happy to receive the respective form/doc at maria.lechner AT wikimedia DOT de.
[17:33:36] <wikibugs>	 (03CR) 10Eric Gardner: [C:03+1] ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati)
[17:34:02] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:35:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[17:35:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[17:36:44] <swfrench-wmf>	 dancy: alright, mwscript-k8s works as expected - I think we're done here :)
[17:36:59] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:37:03] <dancy>	 woohoo!  Thanks for testing swfrench-wmf> 
[17:37:28] <swfrench-wmf>	 thanks for making this happen! :)
[17:41:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:42:02] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225981 (10VRiley-WMF)
[17:42:40] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11225983 (10VRiley-WMF) Finished updating ms-be1087, moving onto ms-be1088
[17:48:03] <wikibugs>	 (03PS1) 10Brouberol: opensearch-operator-crds: add a crds.yaml fixture file to point the CI to the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192185 (https://phabricator.wikimedia.org/T397246)
[17:49:08] <wikibugs>	 (03PS11) 10Krinkle: varnish: Enable unified mobile routing on wikimedia.org wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510)
[17:49:38] <wikibugs>	 (03CR) 10Superpes15: "Ack! many thanks :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[17:49:49] <wikibugs>	 (03CR) 10Btullis: [C:03+1] opensearch-operator-crds: add a crds.yaml fixture file to point the CI to the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192185 (https://phabricator.wikimedia.org/T397246) (owner: 10Brouberol)
[17:50:45] <wikibugs>	 (03PS17) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869)
[17:51:57] <wikibugs>	 (03CR) 10Bearloga: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati)
[17:53:42] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch-operator-crds: add a crds.yaml fixture file to point the CI to the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192185 (https://phabricator.wikimedia.org/T397246) (owner: 10Brouberol)
[17:54:35] <wikibugs>	 (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt)
[17:55:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[17:56:28] <wikibugs>	 (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt)
[17:56:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:00:02] <wikibugs>	 (03CR) 10Bking: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[18:00:26] <wikibugs>	 (03PS27) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[18:02:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: automatically figure out some values to reduce release config size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol)
[18:02:09] <wikibugs>	 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11226050 (10Prototyperspective) Videos used to start quickly and to load quickly. Since a short while they aren't anymore. Maybe I should ask on a Commons board whether other users also have...
[18:04:29] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: automatically figure out some values to reduce release config size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191072 (https://phabricator.wikimedia.org/T405485) (owner: 10Brouberol)
[18:05:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:05:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:08:27] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] "Tests are all passing except for an unexpected broken test introduced in I8553991e419f604585d812db2ce66c9a05a4e764" [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[18:11:56] <wikibugs>	 (03CR) 10Brouberol: opensearch-cluster: Add chart for review (3/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[18:12:14] <wikibugs>	 (03PS2) 10Cappybaraa: diqwiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207)
[18:14:51] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[18:21:15] <wikibugs>	 (03CR) 10Brouberol: "Does a cluster expose prometheus metrics? Do we scrape them?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[18:22:20] <wikibugs>	 (03CR) 10Aaron Schulz: "Interesting that this is opt-out. I get that these CSP headers are used restbase compatibility and perhaps some other non-MW endpoints tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368) (owner: 10Hnowlan)
[18:23:07] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm
[18:23:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11226087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm
[18:24:14] <wikibugs>	 (03PS3) 10Andrea Denisse: mediawiki-engineering: Add API Gateway alerts with thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151)
[18:24:51] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:26:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:28:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:32:34] <wikibugs>	 (03PS9) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[18:33:02] <wikibugs>	 (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[18:34:42] <wikibugs>	 (03CR) 10Dr0ptp4kt: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati)
[18:35:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:35:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:38:58] <wikibugs>	 (03PS1) 10Dzahn: zuul: create systemd unit for zuul scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118)
[18:39:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[18:39:19] <logmsgbot>	 jhancock@cumin1002 reimage (PID 4182690) is awaiting input
[18:44:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[18:47:10] <wikibugs>	 (03PS1) 10Dzahn: move zuul nodepool to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938)
[18:47:24] <wikibugs>	 (03CR) 10Dzahn: "needs https://gerrit.wikimedia.org/r/1192200 to compile" [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[18:47:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[18:48:02] <wikibugs>	 (03PS2) 10Dzahn: move zuul nodepool to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938)
[18:48:08] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] move zuul nodepool to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[18:48:26] <wikibugs>	 (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[18:48:32] <wikibugs>	 (03PS3) 10Dzahn: move zuul nodepool user token to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938)
[18:48:47] <logmsgbot>	 !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1191504|Disable wmgUseMdotRouting on wikimedia.org wikis (group1) (T403510)]]
[18:48:54] <stashbot>	 T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510
[18:50:30] <MatmaRex>	 Lucas_WMDE: btw i finally looked at the log line you posted,
[18:50:33] <MatmaRex>	 [29 Sep 25 16:14] * Lucas_WMDE MatmaRex: “amwiki Would update performer for local #101754 based on global #59763297 from 'Nahomnata' to 'J ansari'”
[18:51:38] <MatmaRex>	 this is in fact correct – "Nahomnata" is not a renamer (and they have 0 edits), their account just by accident has the same ID on amwiki as "J ansari" has on metawiki
[18:54:21] <MatmaRex>	 (actor ID)
[18:55:23] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1191504|Disable wmgUseMdotRouting on wikimedia.org wikis (group1) (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:55:30] <stashbot>	 T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510
[18:55:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:55:54] <wikibugs>	 (03PS1) 10Elukey: role::maps::master: enable planet sync in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1192205 (https://phabricator.wikimedia.org/T381565)
[18:56:16] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] move zuul nodepool user token to new location for I745f8c87b4c57f [labs/private] - 10https://gerrit.wikimedia.org/r/1192200 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[18:57:32] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Continuing with sync
[18:57:35] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "@mmuhlenhoff@wikimedia.org I am re-enabling osm import on maps2009 to allow it to catch up over night, we'd need a codfw stack in two days" [puppet] - 10https://gerrit.wikimedia.org/r/1192205 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[18:57:42] <wikibugs>	 (03CR) 10Elukey: "@mmuhlenhoff@wikimedia.org I am re-enabling osm import on maps2009 to allow it to catch up over night, we'd need a codfw stack in two days" [puppet] - 10https://gerrit.wikimedia.org/r/1192205 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[18:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:58:32] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::maps::master: enable planet sync in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1192205 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[18:58:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[18:59:25] <wikibugs>	 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958 (10RobH) 03NEW
[18:59:57] <wikibugs>	 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11226273 (10RobH) a:03MatthewVernon @MatthewVernon,   Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia....
[19:01:12] <wikibugs>	 (03CR) 10Phuedx: "Understood. Many thanks for the explanation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[19:01:51] <wikibugs>	 (03CR) 10Eric Gardner: ReaderExperiments' ImageBrowsing: don't collect the HTTP user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192138 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati)
[19:02:37] <logmsgbot>	 !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191504|Disable wmgUseMdotRouting on wikimedia.org wikis (group1) (T403510)]] (duration: 13m 50s)
[19:02:44] <stashbot>	 T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510
[19:02:46] <wikibugs>	 (03PS1) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209
[19:05:11] <wikibugs>	 (03PS1) 10Ssingh: P:cache::haproxy: exempt mediawiki.org and /keys from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165)
[19:06:05] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7110/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[19:06:26] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1192179/7108/" [puppet] - 10https://gerrit.wikimedia.org/r/1192179 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[19:06:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[19:06:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[19:07:19] <wikibugs>	 (03PS2) 10Ssingh: P:cache::haproxy: exempt mediawiki.org and /keys from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165)
[19:08:11] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7111/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[19:08:41] <wikibugs>	 (03CR) 10Dzahn: "just a thought. since official MW releases come from releases.wikimedia.org you could also consider moving the GPG key there" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[19:09:56] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Yeah it's a good point and perhaps we should add that as well. Specifically in this case, the CR is a response to T405165." [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[19:10:21] <wikibugs>	 (03PS1) 10Scott French: deployment_server: add mw-script/next tracking PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192202 (https://phabricator.wikimedia.org/T405955)
[19:10:21] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the reviews, Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/1192202 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:10:22] <wikibugs>	 (03PS1) 10Scott French: deployment_server: enable support for PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192203 (https://phabricator.wikimedia.org/T405955)
[19:11:07] <wikibugs>	 (03PS3) 10Ssingh: P:cache::haproxy: exempt mediawiki.org and /keys from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165)
[19:12:10] <wikibugs>	 (03PS4) 10Ssingh: P:cache::haproxy: exempt mediawiki.org and /keys from UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165)
[19:13:04] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7112/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[19:20:21] <wikibugs>	 (03PS1) 10Dzahn: zuul: follow-up fix to moving nodepool config to own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192215 (https://phabricator.wikimedia.org/T395938)
[19:25:23] <wikibugs>	 (03CR) 10BCornwall: "I agree with Daniel - fewer exceptions to handle!" [puppet] - 10https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: 10Ssingh)
[19:25:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[19:26:26] <wikibugs>	 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964 (10RobH) 03NEW
[19:26:49] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226439 (10VRiley-WMF)
[19:26:58] <wikibugs>	 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11226455 (10RobH)
[19:27:21] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226456 (10VRiley-WMF) moving onto ms-be1086
[19:27:38] <wikibugs>	 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11226459 (10RobH) a:03bking @bking,   Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and...
[19:27:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[19:28:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: follow-up fix to moving nodepool config to own profile [puppet] - 10https://gerrit.wikimedia.org/r/1192215 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[19:30:00] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] deployment_server: add mw-script/next tracking PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192202 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:30:08] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1192203 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:30:45] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966 (10RobH) 03NEW
[19:31:17] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11226497 (10RobH) a:03bking @bking,   Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and...
[19:32:05] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11226505 (10RobH)
[19:33:59] <wikibugs>	 (03PS5) 10Kosta Harlan: hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239)
[19:34:05] <wikibugs>	 (03CR) 10Kosta Harlan: hCaptcha: Enable A/B test for frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[19:35:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[19:35:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[19:36:14] <swfrench-wmf>	 jouncebot: nowandnext
[19:36:14] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 23 minute(s)
[19:36:14] <jouncebot>	 In 0 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T2000)
[19:37:34] <swfrench-wmf>	 unless there are any objections, I might merge a puppet patch shortly that requires a follow-on no-sync (i.e., no deploy) scap run
[19:38:33] <wikibugs>	 (03PS28) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[19:38:42] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: add mw-script/next tracking PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192202 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:41:50] * swfrench-wmf is running puppet-agent
[19:42:19] <swfrench-wmf>	 I'll be running scap in ~ 5 minutes
[19:44:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11226602 (10Jhancock.wm) got the pxe issue fixed. but found a new one. @Clement_Goubert this server has to be uefi and it looks like the preseed is set up for bios. if i'm reading...
[19:44:13] <wikibugs>	 (03PS2) 10Dzahn: zuul: create systemd unit for zuul scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118)
[19:44:51] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:48:32] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Non-deploy scap run to initialize mw-script/next helmfile-defaults values - T405955
[19:48:41] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[19:48:48] <logmsgbot>	 !log swfrench@deploy2002 Stopping before sync operations
[19:49:48] * swfrench-wmf is done
[19:50:38] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226618 (10VRiley-WMF) 05In progress→03Open
[19:51:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226637 (10VRiley-WMF) These are all done! will await for the next two. Thanks @MatthewVernon
[19:51:31] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11226648 (10VRiley-WMF) These are all done! will await for the next two. Thanks @MatthewVernon
[19:52:09] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1192195/7114/zuul1001.eqiad.wmnet/change.zuul1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn)
[19:54:23] <wikibugs>	 (03PS3) 10Dzahn: zuul: create systemd unit for zuul scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118)
[19:57:13] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: enable support for PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192203 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[19:57:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[19:57:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[19:57:44] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1192195/7117/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn)
[19:57:48] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: create systemd unit for zuul scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1192195 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T2000)
[20:00:04] <jouncebot>	 lucaswerkmeister and sergi0: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: WIP
[20:00:18] <sergi0>	 o/
[20:00:21] <lucaswerkmeister>	 o/
[20:00:28] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2001.codfw.wmnet with reason: WIP
[20:04:25] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm
[20:04:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11226748 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm executed with errors: -...
[20:04:53] <sergi0>	 lucaswerkmeister: are you self-deploying?
[20:05:29] <lucaswerkmeister>	 preferably not, as it’s Lucas_WMDE who has deployment rights, not my volunteer self ^^
[20:05:36] <lucaswerkmeister>	 but I guess if nobody else is around…
[20:05:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[20:05:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[20:05:44] <sergi0>	 Alright, I can deploy then
[20:06:28] <lucaswerkmeister>	 cool, thanks!
[20:08:16] <wikibugs>	 (03PS1) 10CDanis: puppetserver::volatile: Default to no XCheeseScore [puppet] - 10https://gerrit.wikimedia.org/r/1192224
[20:08:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191804 (https://phabricator.wikimedia.org/T405830) (owner: 10Lucas Werkmeister)
[20:09:00] <wikibugs>	 (03PS3) 10Andrew Bogott: P:openstack: nova: Drop obsolete settings [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah)
[20:09:03] * lucaswerkmeister tries to put a test case together
[20:09:09] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah)
[20:09:14] <wikibugs>	 (03Merged) 10jenkins-bot: Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191804 (https://phabricator.wikimedia.org/T405830) (owner: 10Lucas Werkmeister)
[20:09:32] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on zuul2001.codfw.wmnet with reason: WIP
[20:09:37] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1191804|Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata (T405830)]]
[20:09:43] <stashbot>	 T405830: Enable USERLANGUAGE magic word for Wikidata - https://phabricator.wikimedia.org/T405830
[20:09:44] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: WIP
[20:09:51] <jinxer-wm>	 FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:11:03] <lucaswerkmeister>	 ok I should be able to test it at https://www.wikidata.org/wiki/User:Lucas_Werkmeister/sandbox?uselang=de once it’s on mwdebug
[20:11:28] <wikibugs>	 (03PS29) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[20:11:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] P:openstack: nova: Drop obsolete settings [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah)
[20:14:27] <wikibugs>	 (03CR) 10Bking: "Yes, the cluster exposes metrics at `_prometheus/metrics` on the primary port (9200). I added some annotations in the last patchset to exp" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[20:16:29] <logmsgbot>	 !log sgimeno@deploy2002 lucaswerkmeister, sgimeno: Backport for [[gerrit:1191804|Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata (T405830)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:16:35] <stashbot>	 T405830: Enable USERLANGUAGE magic word for Wikidata - https://phabricator.wikimedia.org/T405830
[20:16:48] <sergi0>	 lucaswerkmeister: please test
[20:17:09] <lucaswerkmeister>	 it works \o/
[20:17:16] <lucaswerkmeister>	 after a purge, https://www.wikidata.org/wiki/User:Lucas_Werkmeister/sandbox?uselang=de says German instead of English
[20:17:37] <sergi0>	 great, syncing
[20:17:53] <logmsgbot>	 !log sgimeno@deploy2002 lucaswerkmeister, sgimeno: Continuing with sync
[20:18:00] <lucaswerkmeister>	 thanks!
[20:18:09] <sergi0>	 yw
[20:22:37] <wikibugs>	 (03PS30) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[20:23:00] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191804|Enable $wgParserEnableUserLanguage ({{USERLANGUAGE}}) on Wikidata (T405830)]] (duration: 13m 23s)
[20:23:07] <stashbot>	 T405830: Enable USERLANGUAGE magic word for Wikidata - https://phabricator.wikimedia.org/T405830
[20:23:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno)
[20:24:46] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: enable new notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190703 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno)
[20:25:05] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1190703|Growth: enable new notifications (T404085)]]
[20:25:11] <stashbot>	 T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085
[20:25:40] <wikibugs>	 (03CR) 10Bking: opensearch-cluster: Add chart for review (3/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[20:26:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[20:26:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[20:29:51] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device ps1-d2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405973 (10phaultfinder) 03NEW
[20:32:25] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1190703|Growth: enable new notifications (T404085)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:32:31] <stashbot>	 T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085
[20:33:47] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[20:33:57] <wikibugs>	 (03PS1) 10BCornwall: Remove wikimedia_trust ACLs from varnish/haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688)
[20:36:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[20:36:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[20:38:50] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1190703|Growth: enable new notifications (T404085)]] (duration: 13m 45s)
[20:38:55] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[20:38:57] <stashbot>	 T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085
[20:39:35] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Import the upstream spark-operator chart version 2.2.1 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[20:40:21] <wikibugs>	 (03Merged) 10jenkins-bot: Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[20:40:26] <sergi0>	 !log end of UTC late backport window
[20:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:00] <wikibugs>	 (03Merged) 10jenkins-bot: Import the upstream spark-operator chart version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[20:42:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw netbox cable cleanup - https://phabricator.wikimedia.org/T402535#11226865 (10Jhancock.wm) a:03Jhancock.wm
[20:42:47] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7123/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688) (owner: 10BCornwall)
[20:43:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11226867 (10Papaul)
[20:44:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11226869 (10Papaul) p:05Triage→03Medium
[20:55:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[20:56:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[20:57:43] <wikibugs>	 (03PS31) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246)
[20:59:27] <wikibugs>	 (03CR) 10Cappybaraa: "Portal is already added to core-Namespaces.php, I checked and it does not need changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T2100).
[21:01:53] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[21:01:55] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1236.eqiad.wmnet with OS bullseye
[21:05:39] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[21:05:39] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[21:05:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11226930 (10BTullis) p:05Triage→03High
[21:09:57] <wikibugs>	 (03PS1) 10BCornwall: wikimedia.support: Rm ncredir, add zendesk records [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952)
[21:10:32] <rzl>	 if the security window isn't in use today, I might deploy some envoy upgrades
[21:13:36] <wikibugs>	 (03PS2) 10BCornwall: wikimedia.support: Rm ncredir, add zendesk records [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952)
[21:14:48] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: shift old full graph hosts to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772)
[21:16:13] <wikibugs>	 (03CR) 10Bking: [C:03+1] "nit: a couple of the hosts are changing roles to internal-scholarly and scholarly (not just internal-main)" [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper)
[21:16:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11226958 (10Papaul)
[21:16:55] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: shift old full graph hosts to new roles [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772)
[21:17:35] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: shift old full graph hosts to new roles [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper)
[21:24:41] <wikibugs>	 (03PS1) 10Btullis: Add 28 new hadoop workers to the analytics_hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1192239 (https://phabricator.wikimedia.org/T398438)
[21:27:39] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[21:27:39] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[21:28:02] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs2016.codfw.wmnet with OS bullseye
[21:28:13] <wikibugs>	 (03PS1) 10Scott French: deployment_server: switch next and migration releases to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1192227 (https://phabricator.wikimedia.org/T405955)
[21:28:14] <wikibugs>	 (03PS1) 10Scott French: trafficserver: enable PHP_ENGINE next routing [puppet] - 10https://gerrit.wikimedia.org/r/1192228 (https://phabricator.wikimedia.org/T405955)
[21:31:22] <wikibugs>	 (03CR) 10Bking: opensearch-cluster: Add chart for review (3/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[21:33:00] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 8 CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1192239 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[21:33:57] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs2017.codfw.wmnet with OS bullseye
[21:35:38] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[21:35:38] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[21:36:30] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mw-*: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191522 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[21:36:53] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1209-1236].eqiad.wmnet
[21:38:33] <wikibugs>	 (03Merged) 10jenkins-bot: mw-*: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191522 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[21:38:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11227045 (10VRiley-WMF) 05Open→03Resolved Replacement unit received and deployed. Contacted vendor multiple times regarding return of the damaged PDU, but no instructions/shipping label have been provided. A...
[21:39:33] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1209-1236].eqiad.wmnet
[21:40:54] <logmsgbot>	 !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1191522 T403663
[21:41:01] <stashbot>	 T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663
[21:41:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:42:38] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11227064 (10BTullis) a:03BTullis
[21:43:55] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11227070 (10BTullis) p:05Triage→03Low
[21:45:41] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage
[21:46:37] <logmsgbot>	 !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1191522 T403663 (duration: 06m 44s)
[21:46:44] <stashbot>	 T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663
[21:47:55] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11227080 (10BTullis) There are now 5 hosts showing this error: {F66711315} * an-worker1187 * an-...
[21:48:51] <logmsgbot>	 !log bking@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage
[21:52:21] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980 (10RobH) 03NEW
[21:52:37] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11227121 (10RobH)
[21:54:34] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11227127 (10RobH) @jgreen,  Please note this host will still leverage BIOS capable booting and can be setup as such (you did not specify in the ordering task) but future generations sta...
[21:54:58] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11227128 (10RobH)
[21:55:14] <wikibugs>	 (03CR) 10Btullis: "I wonder if we should revisit the reasons for choosing the opensearch-operator version 2.7.0." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[21:56:03] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11227129 (10RobH)
[21:57:24] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981 (10RobH) 03NEW
[21:57:59] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11227159 (10RobH)
[22:01:39] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio200[1-3] - https://phabricator.wikimedia.org/T405982 (10RobH) 03NEW
[22:01:45] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio100[1-3] - https://phabricator.wikimedia.org/T405983 (10RobH) 03NEW
[22:01:54] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio200[1-3] - https://phabricator.wikimedia.org/T405982#11227198 (10RobH)
[22:02:07] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio100[1-3] - https://phabricator.wikimedia.org/T405983#11227202 (10RobH)
[22:05:43] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mw-videoscaler: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191523 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[22:07:21] <wikibugs>	 (03Merged) 10jenkins-bot: mw-videoscaler: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191523 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus)
[22:14:12] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[22:14:16] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[22:14:51] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[22:17:41] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[22:17:48] <logmsgbot>	 !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[22:18:23] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Wikisource [puppet] - 10https://gerrit.wikimedia.org/r/1192246 (https://phabricator.wikimedia.org/T403510)
[22:18:25] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/1192247 (https://phabricator.wikimedia.org/T403510)
[22:21:59] <wikibugs>	 (03CR) 10Bking: "I should have done a better job of documenting this, but the 2.8.0 chart is not compatible with OpenSearch 2.7.0 (see https://github.com/o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[22:24:51] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:25:45] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[22:27:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "as far as I can see it looks good to me" [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall)
[22:29:45] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[22:34:38] <wikibugs>	 06SRE, 13Patch-For-Review: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11227347 (10BCornwall) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180712 Seems to have broken varnish tests. Looking through seems to suggest this is because `profile::cache...
[22:35:45] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[22:35:45] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[22:38:43] <wikibugs>	 (03PS1) 10Btullis: Customise the login.html template of JupyterHub to hide the TLS warning [puppet] - 10https://gerrit.wikimedia.org/r/1192259 (https://phabricator.wikimedia.org/T403863)
[22:40:17] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7133/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192259 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis)
[22:54:18] <logmsgbot>	 !log bking@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2017.codfw.wmnet with OS bullseye
[22:54:45] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[22:55:37] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet']
[22:55:49] <logmsgbot>	 !log bking@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet']
[22:56:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet']
[22:56:35] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet']
[22:56:45] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[22:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250929T2300)
[23:05:45] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[23:05:45] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[23:06:20] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on Wikisource [puppet] - 10https://gerrit.wikimedia.org/r/1192246 (https://phabricator.wikimedia.org/T403510)
[23:06:20] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/1192247 (https://phabricator.wikimedia.org/T403510)
[23:06:20] <wikibugs>	 (03PS1) 10Krinkle: beta: Remove redundant enable_m_redir_except_regex setting [puppet] - 10https://gerrit.wikimedia.org/r/1192263 (https://phabricator.wikimedia.org/T403510)
[23:06:22] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/1192264 (https://phabricator.wikimedia.org/T403510)
[23:06:24] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510)
[23:06:27] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on fr.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510)
[23:06:29] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on de.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192267 (https://phabricator.wikimedia.org/T403510)
[23:06:31] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on es.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510)
[23:06:35] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on ru.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192269 (https://phabricator.wikimedia.org/T403510)
[23:06:38] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on ja.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192270 (https://phabricator.wikimedia.org/T403510)
[23:06:42] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510)
[23:06:46] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510)
[23:15:04] <wikibugs>	 (03PS1) 10Krinkle: beta: Remove redundant enable_m_redir_except_regex setting [puppet] - 10https://gerrit.wikimedia.org/r/1192263 (https://phabricator.wikimedia.org/T403510)
[23:23:23] <wikibugs>	 (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192276 (https://phabricator.wikimedia.org/T403510)
[23:23:26] <wikibugs>	 (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192277 (https://phabricator.wikimedia.org/T403510)
[23:23:28] <wikibugs>	 (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192278 (https://phabricator.wikimedia.org/T403510)
[23:23:30] <wikibugs>	 (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192279 (https://phabricator.wikimedia.org/T403510)
[23:25:45] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[23:27:45] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[23:35:45] <icinga-wm>	 RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[23:35:45] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[23:37:52] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192281
[23:37:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192281 (owner: 10TrainBranchBot)
[23:44:51] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:46:22] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[23:46:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11227509 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet wi...
[23:49:38] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: Prefix `helmfile apply` output with "[service env]" [puppet] - 10https://gerrit.wikimedia.org/r/1192282
[23:49:45] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2016 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:49:45] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:54:45] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192281 (owner: 10TrainBranchBot)
[23:54:51] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:56:45] <icinga-wm>	 PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[23:58:08] <wikibugs>	 (03PS1) 10BCornwall: Remove wikimedia.support from ncredir/acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/1192283 (https://phabricator.wikimedia.org/T400952)
[23:58:45] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[23:59:10] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable