[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196542 (owner: 10TrainBranchBot) [00:04:07] any staff here who can help me [00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:26] 10SRE-swift-storage: File missing from four datacenters - https://phabricator.wikimedia.org/T407589#11283723 (10Bawolff) I tried doing a cache purge, which didn't change anything, so I suspect its missing from eqiad swift but is present in codfw swift. [00:07:11] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196780 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196780 (owner: 10TrainBranchBot) [00:09:23] FIRING: [10x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:17] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196780 (owner: 10TrainBranchBot) [00:38:58] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:01:07] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:15:12] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 04s) [01:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:11] FIRING: [10x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:09] 10SRE-swift-storage: File missing from four datacenters - https://phabricator.wikimedia.org/T407589#11283797 (101brianm7) I was the uploader. Hopefully this helps. After clicking the upload button on the File Wizard, it said "this may take a minute or two" or something along those lines, and so I tabbed out. Whe... [01:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:42:11] FIRING: [10x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:43] (03PS5) 10Scott French: P:cache::haproxy: start preparing for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) [02:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:25:04] (03PS6) 10Scott French: P:cache::haproxy: start preparing for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) [02:49:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:50:34] !incidents [02:50:35] 6880 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [02:50:35] 6879 (RESOLVED) ns1-v4/Auth DNS (paged) [02:50:35] 6878 (RESOLVED) Host db2187 (paged) [02:50:35] 6877 (RESOLVED) Host db2153 (paged) [02:50:35] 6875 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [02:50:36] 6871 (RESOLVED) [2x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [02:51:46] !ack 6880 [02:51:47] 6880 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [02:54:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [03:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:07] 06SRE, 06cloud-services-team: latested Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11283890 (10Andrew) (feel free to retry with that host, it's not doing much at the moment other than being broken) [04:21:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:22:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:22:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:26:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:27:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:31:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:32:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:32:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:36:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:37:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:43:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:57:37] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11283915 (10Marostegui) 05Open→03Resolved >>! In T393042#11282408, @Jhancock.wm wrote: > @Marostegui you have to root into the server and view the console to see what the install... [05:02:34] (03PS2) 10Marostegui: mariadb: Define mariadb packages for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1196628 (https://phabricator.wikimedia.org/T407472) [05:02:35] (03PS1) 10Marostegui: es1056: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196788 (https://phabricator.wikimedia.org/T406488) [05:03:25] (03CR) 10Marostegui: [C:03+2] es1056: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196788 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:05:15] (03CR) 10Marostegui: [C:03+2] mariadb: Define mariadb packages for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1196628 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [05:08:27] (03PS1) 10Marostegui: instances.yaml: Add es1056 [puppet] - 10https://gerrit.wikimedia.org/r/1196789 (https://phabricator.wikimedia.org/T406488) [05:08:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1056 [puppet] - 10https://gerrit.wikimedia.org/r/1196789 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:11:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add es1056 to dbctl T406488', diff saved to https://phabricator.wikimedia.org/P84044 and previous config saved to /var/cache/conftool/dbconfig/20251017-051114-marostegui.json [05:11:19] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:11:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84045 and previous config saved to /var/cache/conftool/dbconfig/20251017-051121-root.json [05:26:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84046 and previous config saved to /var/cache/conftool/dbconfig/20251017-052627-root.json [05:26:32] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:26:54] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11283934 (10Tobi_WMDE_SW) >>! In T403298#11181575, @HShaikh wrote: > reading the exchange above. I feel like there is a need to get the IPs whitelisted and i can for... [05:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:38:28] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84047 and previous config saved to /var/cache/conftool/dbconfig/20251017-054133-root.json [05:41:38] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:42:11] FIRING: [10x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1027 T407595', diff saved to https://phabricator.wikimedia.org/P84048 and previous config saved to /var/cache/conftool/dbconfig/20251017-054458-marostegui.json [05:45:04] T407595: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595 [05:45:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es1027.eqiad.wmnet with reason: Cloning [05:46:02] (03PS1) 10Marostegui: es1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196790 (https://phabricator.wikimedia.org/T407595) [05:46:38] (03CR) 10Marostegui: [C:03+2] es1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1196790 (https://phabricator.wikimedia.org/T407595) (owner: 10Marostegui) [05:56:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84049 and previous config saved to /var/cache/conftool/dbconfig/20251017-055639-root.json [05:56:44] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251017T0600) [06:11:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84050 and previous config saved to /var/cache/conftool/dbconfig/20251017-061145-root.json [06:11:51] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:16:04] (03PS1) 10Arnaudb: gerrit: unmask service & disable backup temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) [06:16:04] (03CR) 10Arnaudb: "to prepare for the next switchover, whether they are run across secondary or primary instance, we'll need to fully resync gerrit2003. That" [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:24:38] (03CR) 10Slyngshede: [C:03+1] Enable the Prometheus exporter for the Ganeti CA on Ganeti masters [puppet] - 10https://gerrit.wikimedia.org/r/1196634 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [06:26:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84051 and previous config saved to /var/cache/conftool/dbconfig/20251017-062651-root.json [06:26:56] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:41:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84052 and previous config saved to /var/cache/conftool/dbconfig/20251017-064157-root.json [06:42:02] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:43:16] (03PS2) 10Arnaudb: gerrit: stop stopping gerrit.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1196695 (https://phabricator.wikimedia.org/T387833) [06:44:05] (03CR) 10Arnaudb: [C:03+2] "thanks for the dig!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1196684 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:44:26] (03CR) 10Arnaudb: [C:03+2] gerrit: stop puppet across all instances [cookbooks] - 10https://gerrit.wikimedia.org/r/1196694 (https://phabricator.wikimedia.org/T407200) (owner: 10Arnaudb) [06:49:35] (03Merged) 10jenkins-bot: gerrit: rsync and chown fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1196684 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:50:34] (03Merged) 10jenkins-bot: gerrit: stop puppet across all instances [cookbooks] - 10https://gerrit.wikimedia.org/r/1196694 (https://phabricator.wikimedia.org/T407200) (owner: 10Arnaudb) [06:57:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84053 and previous config saved to /var/cache/conftool/dbconfig/20251017-065703-root.json [06:57:08] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:59:22] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] cloudceph: handle double / single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251017T0700) [07:05:08] (03PS1) 10Filippo Giunchedi: admin: add FIDO public key for filippo [puppet] - 10https://gerrit.wikimedia.org/r/1196795 [07:06:54] (03PS1) 10Slyngshede: data.yaml: Add an additional FIDO ssh key for slyngshede [puppet] - 10https://gerrit.wikimedia.org/r/1196796 [07:07:07] (03PS1) 10Filippo Giunchedi: hieradata: move cloudcephosd1051 to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1196798 (https://phabricator.wikimedia.org/T405478) [07:12:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84054 and previous config saved to /var/cache/conftool/dbconfig/20251017-071209-root.json [07:12:15] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:18:17] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11284101 (10Aklapper) [07:23:31] (03CR) 10Filippo Giunchedi: [C:03+2] "Host isn't doing anything yet, I'll just do it™" [puppet] - 10https://gerrit.wikimedia.org/r/1196798 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [07:27:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84055 and previous config saved to /var/cache/conftool/dbconfig/20251017-072715-root.json [07:27:20] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:29:08] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1), 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11284124 (10fgiunchedi) For the record, these are the effects of moving an host from double... [07:29:46] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1), 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11284125 (10fgiunchedi) There's some harmless error/race when setting mtu, though eventuall... [07:35:14] (03CR) 10Jelto: [C:03+1] "It should be fine to put this setting into `hieradata/cloud/eqiad1/gitlab-runners/common.yaml` and `hieradata/cloud/eqiad1/devtools/common" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [07:36:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqdfw and NTT (128.242.179.181) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:42:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1056 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P84056 and previous config saved to /var/cache/conftool/dbconfig/20251017-074221-root.json [07:42:26] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:42:37] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1196695 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:42:55] (03CR) 10Arnaudb: [C:03+2] gerrit: stop stopping gerrit.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1196695 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:43:19] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11284156 (10elukey) Hi Matthew! Usually the first one that starts testing a new OS has also the daunting duty of figuring out missing packages, the Infra Foundatio... [07:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:47:28] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:47:48] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:49:14] 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, 07Epic: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#11284163 (10Marostegui) I don't recall any - let's keep an eye on them though [07:49:31] (03Merged) 10jenkins-bot: gerrit: stop stopping gerrit.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1196695 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:50:42] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11284167 (10Joe) For python3-conftool and dervied packages, you can simply patch the `.gitlab-ci.yml` file to generate debs also for trixie, and then get them from... [07:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:53:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:xe-0/1/1 (Transit: NTT (253065) {#11401}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:58:08] (03PS1) 10Elukey: services: move tegola and kartotherian's eqiad configs to the new stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196803 (https://phabricator.wikimedia.org/T381565) [08:04:52] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone_es of es2032.codfw.wmnet onto es2055.codfw.wmnet [08:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:35] (03PS1) 10Marostegui: mariadb: Productionize db2246 [puppet] - 10https://gerrit.wikimedia.org/r/1196827 (https://phabricator.wikimedia.org/T406551) [08:07:28] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:07:48] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:08:21] fceratto@cumin1003 clone_es (PID 4087945) is awaiting input [08:08:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:xe-0/1/1 (Transit: NTT (253065) {#11401}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:08:59] !log draining Arelion eqiad <-> codfw transport wiht OSPF metric and re-enabling port on cr1-eqiad [08:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:52] (03CR) 10Giuseppe Lavagetto: "This is how I originally wrote the code for x-provenance; we decided to use the variable across the board as they are:" [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [08:11:45] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2246 [puppet] - 10https://gerrit.wikimedia.org/r/1196827 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [08:12:11] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 transport flapping, disabled for now - https://phabricator.wikimedia.org/T407578#11284197 (10cmooney) Thanks Brandon you did the right thing. For now, for troubleshooting, I have set the Arelion circuit to 'drained' sta... [08:14:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:19:06] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:19:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2248.codfw.wmnet onto db2246.codfw.wmnet [08:19:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db2248 - Depool db2248.codfw.wmnet to then clone it to db2246.codfw.wmnet - marostegui@cumin1003 [08:19:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2248 - Depool db2248.codfw.wmnet to then clone it to db2246.codfw.wmnet - marostegui@cumin1003 [08:22:13] (03PS1) 10Phuedx: MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) [08:23:19] fceratto@cumin1003 clone_es (PID 4087945) is awaiting input [08:24:49] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 transport flapping, disabled for now - https://phabricator.wikimedia.org/T407578#11284209 (10cmooney) Seems this started fairly suddenly yesterday afternoon: {F66756494 width=800} The link is flapping hard up/down cons... [08:30:56] (03PS1) 10Marostegui: installserver: Do not format es1050 [puppet] - 10https://gerrit.wikimedia.org/r/1196860 [08:35:39] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es1050 [puppet] - 10https://gerrit.wikimedia.org/r/1196860 (owner: 10Marostegui) [08:40:50] (03PS4) 10JavierMonton: topic: Adding javiermonton to `analytics-admins` and `deployment` groups. [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) [08:43:15] (03CR) 10Clément Goubert: [C:03+2] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [08:43:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:44:51] (03Merged) 10jenkins-bot: Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [08:46:29] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [08:47:03] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [08:48:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11284263 (10Gehel) [08:49:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key also verified out of band." [puppet] - 10https://gerrit.wikimedia.org/r/1196796 (owner: 10Slyngshede) [08:50:49] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11284327 (10Gehel) [08:51:13] 07sre-alert-triage, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11284345 (10Gehel) [08:52:00] 06SRE, 06Infrastructure-Foundations, 10netops: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11284373 (10cmooney) [08:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:53:43] (03CR) 10FNegri: "Thanks @dzahn@wikimedia.org and @jwodstrcil@wikimedia.org -- I'll merge this patch as it is, and let Jelto handle the `devtools` config in" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [08:54:05] (03CR) 10FNegri: [C:03+2] hiera: gitlab::runner::docker set MTU to 1450 [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [08:54:25] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Upgrade Envoy to v1.29.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11284450 (10Gehel) [08:54:43] (03CR) 10Jelto: [C:03+1] "sounds good! thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1196493 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [09:02:11] 06SRE, 06Infrastructure-Foundations, 10netops: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11284488 (10cmooney) The link has been mostly stable since re-enabling it at 08:15 UTC, it flapped a few times immediat... [09:05:41] (03PS16) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [09:05:42] (03PS16) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [09:05:42] (03PS15) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [09:05:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet - https://phabricator.wikimedia.org/T407605 (10Virginie.caplet) 03NEW [09:07:18] 06SRE, 10SRE-Access-Requests: Enroll Jeltos YubiKey for production access - https://phabricator.wikimedia.org/T407606 (10Jelto) 03NEW [09:08:16] (03PS1) 10Jelto: admin: add FIDO YubiKey key to jelto [puppet] - 10https://gerrit.wikimedia.org/r/1196865 (https://phabricator.wikimedia.org/T407606) [09:08:35] (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good and validated out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1196795 (owner: 10Filippo Giunchedi) [09:08:38] 06SRE, 10GitLab: OpenSSH 10.1+ warns that Wikimedia SSH does not use post-quantum key exchange algorithm - https://phabricator.wikimedia.org/T407557#11284517 (10Lucas_Werkmeister_WMDE) [09:09:54] (03PS1) 10Marostegui: Revert "mariadb: Define mariadb packages for trixie" [puppet] - 10https://gerrit.wikimedia.org/r/1196866 [09:10:54] 06SRE, 10GitLab: OpenSSH 10.1+ warns that Wikimedia SSH does not use post-quantum key exchange algorithm - https://phabricator.wikimedia.org/T407557#11284532 (10Lucas_Werkmeister_WMDE) >>! In T407557#11282964, @LucasWerkmeister wrote: > `lang=shell-session > me@host operations-puppet $ git grep curve25519-sha2... [09:12:04] (03CR) 10Marostegui: [C:03+2] Revert "mariadb: Define mariadb packages for trixie" [puppet] - 10https://gerrit.wikimedia.org/r/1196866 (owner: 10Marostegui) [09:13:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7294/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [09:13:50] 06SRE, 10GitLab: OpenSSH 10.1+ warns that Wikimedia SSH does not use post-quantum key exchange algorithm - https://phabricator.wikimedia.org/T407557#11284538 (10cmooney) p:05Triage→03Low I'm not really sure this is a massive issue right now. It's not clear to me that ssh sessions logs from now will be hug... [09:23:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:23:49] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:11] (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good and validated out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1196865 (https://phabricator.wikimedia.org/T407606) (owner: 10Jelto) [09:27:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:xe-0/1/1 (Transit: NTT (253065) {#11401}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:01] (03CR) 10Jelto: [C:03+2] admin: add FIDO YubiKey key to jelto [puppet] - 10https://gerrit.wikimedia.org/r/1196865 (https://phabricator.wikimedia.org/T407606) (owner: 10Jelto) [09:33:22] (03PS1) 10Clément Goubert: Revert "Route old /api/rest_v1/?specs endpoints to static JSON files" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196869 [09:33:29] (03CR) 10Clément Goubert: [C:03+2] Revert "Route old /api/rest_v1/?specs endpoints to static JSON files" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196869 (owner: 10Clément Goubert) [09:34:58] (03Merged) 10jenkins-bot: Revert "Route old /api/rest_v1/?specs endpoints to static JSON files" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196869 (owner: 10Clément Goubert) [09:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:36:29] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:36:48] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:36:49] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:37:00] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:37:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:xe-0/1/1 (Transit: NTT (253065) {#11401}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:40:21] (03CR) 10Filippo Giunchedi: [C:03+2] admin: add FIDO public key for filippo [puppet] - 10https://gerrit.wikimedia.org/r/1196795 (owner: 10Filippo Giunchedi) [09:42:11] FIRING: [10x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:21] 06SRE, 06Infrastructure-Foundations, 10netops: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11284632 (10cmooney) So there was a known fault on the Arelion side and they had raised a ticket internally about it.... [10:02:14] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:02:30] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:03:19] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:03:33] !log un-draining Arelion 100G transport eqiad <-> codfw following carrier fibre fix and return to stability T407578 [10:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:38] T407578: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578 [10:03:57] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:05:07] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:05:25] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:05:30] (03PS1) 10Clément Goubert: Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) [10:10:36] 06SRE, 10GitLab: OpenSSH 10.1+ warns that Wikimedia SSH does not use post-quantum key exchange algorithm - https://phabricator.wikimedia.org/T407557#11284644 (10Ladsgroup) (Wrong Mortiz) [10:16:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [10:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:22:42] (03PS2) 10Clément Goubert: Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) [10:25:40] (03CR) 10Hnowlan: Route old /api/rest_v1/?specs endpoints to static JSON files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) (owner: 10Clément Goubert) [10:27:10] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet - https://phabricator.wikimedia.org/T407605#11284686 (10Ladsgroup) [10:27:15] (03PS3) 10Clément Goubert: Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) [10:29:57] (03CR) 10Clément Goubert: Route old /api/rest_v1/?specs endpoints to static JSON files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) (owner: 10Clément Goubert) [10:31:58] (03CR) 10Hnowlan: [C:03+1] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) (owner: 10Clément Goubert) [10:32:16] (03CR) 10Clément Goubert: [C:03+2] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) (owner: 10Clément Goubert) [10:32:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet - https://phabricator.wikimedia.org/T407605#11284692 (10Ladsgroup) @KFrancis Hi, Would you mind starting the process of NDA on file for Virginie Caplet (WMDE)? [10:34:12] (03Merged) 10jenkins-bot: Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) (owner: 10Clément Goubert) [10:34:35] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:35:13] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:35:59] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:36:09] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:43:52] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:44:24] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:49:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196803 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:50:34] (03PS1) 10Esanders: Follow-up I6698875: Set insert-ignore on all insert queries [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196884 (https://phabricator.wikimedia.org/T407357) [10:51:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196884 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [10:52:31] 07sre-alert-triage, 06serviceops: Alert in need of triage: ProbeDown (instance proxoid:4260) - https://phabricator.wikimedia.org/T407615 (10LSobanski) 03NEW [10:52:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2248 gradually with 4 steps - Pool db2248.codfw.wmnet in after cloning [10:52:48] 07sre-alert-triage, 06serviceops: Alert in need of triage: ProbeDown (instance proxoid:4260) - https://phabricator.wikimedia.org/T407615#11284774 (10LSobanski) The alert is also firing for codfw. [10:59:24] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-eqdfw and NTT (128.242.179.181) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251017T0700) [11:00:05] jelto, arnoldokoth, and mutante: I, the Bot under the Fountain, call upon thee, The Deployer, to do GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251017T1100). [11:00:40] (03CR) 10CI reject: [V:04-1] Follow-up I6698875: Set insert-ignore on all insert queries [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196884 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [11:06:09] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:06:21] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:08:29] (03PS1) 10Jcrespo: cumin: Migrate cumin1002 mariadb remote backups to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) [11:10:08] (03PS2) 10Jcrespo: cumin: Migrate cumin1002 mariadb remote backups to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) [11:11:15] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:11:36] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:11:41] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:24:58] (03PS5) 10JavierMonton: topic: Adding javiermonton to `analytics-admins` and `deployment` groups. [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) [11:25:06] (03CR) 10Ladsgroup: [C:03+2] topic: Adding javiermonton to `analytics-admins` and `deployment` groups. [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) (owner: 10JavierMonton) [11:25:07] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11284845 (10Ladsgroup) [11:25:08] (03CR) 10Ladsgroup: [V:03+2 C:03+2] topic: Adding javiermonton to `analytics-admins` and `deployment` groups. [puppet] - 10https://gerrit.wikimedia.org/r/1196010 (https://phabricator.wikimedia.org/T407187) (owner: 10JavierMonton) [11:27:35] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [11:27:57] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11284849 (10Ladsgroup) [11:28:27] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11284850 (10Ladsgroup) [11:28:51] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for JavierMonton - https://phabricator.wikimedia.org/T407187#11284852 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [11:32:06] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11284856 (10Ladsgroup) [11:33:26] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11284858 (10Ladsgroup) There is no NDA on file from what I'm seeing. @KFrancis Hi, would you mind starting the process of NDA for Sean Leong too? As part of WMDE staff. [11:38:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2248 gradually with 4 steps - Pool db2248.codfw.wmnet in after cloning [11:38:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2248.codfw.wmnet onto db2246.codfw.wmnet [11:42:34] (03PS1) 10Marostegui: packages_wmf,packages_client.pp: Add trixie [puppet] - 10https://gerrit.wikimedia.org/r/1196893 (https://phabricator.wikimedia.org/T407472) [11:43:48] (03PS1) 10Ladsgroup: admin: Add neslihanturan to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1196894 (https://phabricator.wikimedia.org/T406590) [11:44:35] (03CR) 10CI reject: [V:04-1] admin: Add neslihanturan to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1196894 (https://phabricator.wikimedia.org/T406590) (owner: 10Ladsgroup) [11:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:45:56] (03CR) 10Ladsgroup: [C:04-2] "I have not confirmed the ssh key (both WMCS and oob)" [puppet] - 10https://gerrit.wikimedia.org/r/1196894 (https://phabricator.wikimedia.org/T406590) (owner: 10Ladsgroup) [11:46:31] (03CR) 10Ladsgroup: [C:03+1] packages_wmf,packages_client.pp: Add trixie [puppet] - 10https://gerrit.wikimedia.org/r/1196893 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [11:46:41] (03CR) 10Marostegui: [C:03+2] packages_wmf,packages_client.pp: Add trixie [puppet] - 10https://gerrit.wikimedia.org/r/1196893 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [11:47:09] (03PS2) 10Ladsgroup: admin: Add neslihanturan to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1196894 (https://phabricator.wikimedia.org/T406590) [11:51:29] (03Abandoned) 10Brouberol: opensearch-operator: install the operator via admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189822 (https://phabricator.wikimedia.org/T404906) (owner: 10Brouberol) [11:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:52:48] (03PS3) 10Jcrespo: cumin: Migrate cumin1002 mariadb remote backups to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) [11:59:28] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [12:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:54] (03PS1) 10MVernon: Update mvernon ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1196897 [12:06:19] (03PS1) 10Marostegui: db1195: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196898 (https://phabricator.wikimedia.org/T407463) [12:06:54] (03CR) 10Marostegui: [C:03+2] db1195: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1196898 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [12:07:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1195.eqiad.wmnet with reason: Maintenance [12:07:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1195 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84062 and previous config saved to /var/cache/conftool/dbconfig/20251017-120737-marostegui.json [12:12:26] (03PS2) 10MVernon: Update mvernon ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1196897 [12:12:45] (03CR) 10Ladsgroup: [C:03+2] Update mvernon ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1196897 (owner: 10MVernon) [12:12:47] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Update mvernon ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1196897 (owner: 10MVernon) [12:13:02] (03CR) 10Jcrespo: [C:03+1] "Let's yolo the migration and revert if it doesn't work!" [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [12:15:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1195 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84063 and previous config saved to /var/cache/conftool/dbconfig/20251017-121548-root.json [12:16:23] (03CR) 10Marostegui: "All grants are in place I assume?" [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [12:17:42] (03PS1) 10Cathal Mooney: netops: add new BGP group names to CoreBGPDwon alert [alerts] - 10https://gerrit.wikimedia.org/r/1196900 (https://phabricator.wikimedia.org/T405558) [12:18:17] (03CR) 10Elukey: "LGTM Thanks! I am also adding Riccardo to double check :)" [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [12:19:31] (03CR) 10Jcrespo: [C:03+1] "Remote backups does not require special grants, only local (logical) backups do." [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [12:21:54] (03CR) 10Marostegui: [C:03+1] cumin: Migrate cumin1002 mariadb remote backups to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [12:30:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1195 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84064 and previous config saved to /var/cache/conftool/dbconfig/20251017-123054-root.json [12:33:06] (03CR) 10Esanders: "recheck" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196884 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [12:35:25] 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11285096 (10Jclark-ctr) @Eevans are you able to reimage the server i have had no luck due to no root partition error. and preseed file has -efi for raid configuration for a server setup for legacy bios? [12:43:17] 10SRE-swift-storage: File missing from four datacenters - https://phabricator.wikimedia.org/T407589#11285110 (10Pigsonthewing) I re-uploaded the same image (that is, the image from the stated source) using the "Upload a new version of this file". That seems to have resolved the issue. [12:43:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:46:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1195 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84066 and previous config saved to /var/cache/conftool/dbconfig/20251017-124600-root.json [12:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:52:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [12:52:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [12:54:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [12:54:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [12:56:23] (03PS1) 10Vgutierrez: profile::base::certificates: Ship Sectigo (E|R)46 root CAs on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1196909 [12:59:05] (03PS1) 10Federico Ceratto: site.pp: set role for db-test* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1196910 (https://phabricator.wikimedia.org/T400056) [12:59:36] (03CR) 10CI reject: [V:04-1] site.pp: set role for db-test* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1196910 (https://phabricator.wikimedia.org/T400056) (owner: 10Federico Ceratto) [13:00:23] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196909 (owner: 10Vgutierrez) [13:01:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1195 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84067 and previous config saved to /var/cache/conftool/dbconfig/20251017-130106-root.json [13:02:41] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Verified the contents of the files are the same as on my bookworm system which got them from debian repos." [puppet] - 10https://gerrit.wikimedia.org/r/1196909 (owner: 10Vgutierrez) [13:09:57] !log updating ca-certificates package on bookworm puppetservers [13:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:31] (03CR) 10Neslihan Turan: "recheck" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196681 (owner: 10Neslihan Turan) [13:12:17] (03PS1) 10Dreamy Jazz: CheckUser UserInfoCard: Enable XTools menu link on SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196914 (https://phabricator.wikimedia.org/T406012) [13:15:25] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:59] (03PS2) 10Vgutierrez: profile::base::certificates: Ship Sectigo (E|R)46 root CAs on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1196909 [13:17:18] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196909 (owner: 10Vgutierrez) [13:18:01] (03CR) 10Cathal Mooney: [C:03+1] profile::base::certificates: Ship Sectigo (E|R)46 root CAs on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1196909 (owner: 10Vgutierrez) [13:18:39] (03PS27) 10Daniel Kinzler: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [13:20:41] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [13:20:49] (03CR) 10Tiziano Fogli: [C:03+1] netops: add new BGP group names to CoreBGPDwon alert [alerts] - 10https://gerrit.wikimedia.org/r/1196900 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [13:21:03] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [13:24:33] (03PS12) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [13:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:23] FIRING: [10x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:35:31] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11285222 (10Andrew) btw the preseed for this host is: ` 'cloudcontrol100[8-9]|cloudcontrol1010|cloudcontrol2010-dev': - partman/standard.cfg - partman/raid10-4dev.cfg ` [13:36:22] (03PS1) 10Cathal Mooney: gnmic: Adjust BGP collection for Nokia compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1196917 (https://phabricator.wikimedia.org/T405558) [13:39:16] (03PS2) 10Cathal Mooney: gnmic: Adjust BGP collection for Nokia compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1196917 (https://phabricator.wikimedia.org/T405558) [13:39:17] (03CR) 10Kosta Harlan: [C:03+1] CheckUser UserInfoCard: Enable XTools menu link on SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196914 (https://phabricator.wikimedia.org/T406012) (owner: 10Dreamy Jazz) [13:39:22] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11285239 (10Ladsgroup) Pinged the user out of band for confirmation of the ssh key. [13:39:39] (03PS17) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [13:39:39] (03PS16) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [13:40:10] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [13:40:18] (03PS2) 10Federico Ceratto: site.pp: set role for db-test* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1196910 (https://phabricator.wikimedia.org/T400056) [13:43:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1196909 (owner: 10Vgutierrez) [13:55:01] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [14:02:47] (03PS1) 10Slyngshede: Phabricator: Allow users to link Phabricator and developer accounts [software/bitu] - 10https://gerrit.wikimedia.org/r/1196919 (https://phabricator.wikimedia.org/T406495) [14:06:19] (03PS1) 10Brouberol: multirootca: add the client auth usage to the dse_k8s discovery issuer profile [puppet] - 10https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) [14:07:24] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7295/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) (owner: 10Brouberol) [14:11:40] (03PS1) 10Superpes15: [igwiki] Create 'autopatrolled' and 'rollbacker' usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196921 (https://phabricator.wikimedia.org/T407439) [14:15:40] (03PS4) 10Tiziano Fogli: haproxy_alive: enable nrpe wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) [14:15:40] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [14:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:21:16] (03PS2) 10Brouberol: multirootca: add the client auth usage to the dse_k8s discovery issuer profile [puppet] - 10https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) [14:22:43] (03CR) 10Vgutierrez: "current implementation requires that the loadbalancer is pooled before starting the reboot cookbook, this feels counter-intuitive, I'm won" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [14:23:02] (03PS1) 10Elukey: Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) [14:30:30] (03CR) 10Tiziano Fogli: [C:03+1] thanos-rule: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [14:30:52] (03CR) 10CI reject: [V:04-1] Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [14:32:18] (03PS2) 10Elukey: Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) [14:33:25] (03PS1) 10Tiziano Fogli: haproxy_failover: enable nrpe wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196925 (https://phabricator.wikimedia.org/T407137) [14:38:24] (03PS1) 10Cathal Mooney: Add new Nokia switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) [14:39:19] (03CR) 10Felds: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [14:39:38] (03CR) 10Felds: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [14:40:10] (03CR) 10Felds: [C:03+1] Add new Nokia switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [14:40:50] (03CR) 10CI reject: [V:04-1] Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [14:41:09] (03CR) 10Ssingh: haproxy_alive: enable nrpe wrapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [14:41:17] (03PS1) 10Ozge: feat: upgrades article descriptions buildkit 1.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196927 [14:41:47] (03PS3) 10Elukey: Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) [14:43:40] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196927 (owner: 10Ozge) [14:44:12] (03CR) 10Ozge: [C:03+2] feat: upgrades article descriptions buildkit 1.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196927 (owner: 10Ozge) [14:44:20] (03PS2) 10Cathal Mooney: Add new Nokia switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) [14:44:50] (03CR) 10CI reject: [V:04-1] Add new Nokia switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [14:46:48] (03PS3) 10Cathal Mooney: Add new Nokia switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) [14:47:21] 10ops-eqiad, 06SRE, 06DC-Ops: ML Server Models incorrectly entered into Netbox - https://phabricator.wikimedia.org/T407635 (10Jclark-ctr) 03NEW [14:47:40] (03PS5) 10Tiziano Fogli: haproxy: enable nrpe2nodexp wrapper on haproxy_alive check [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) [14:47:41] (03PS2) 10Tiziano Fogli: mariadb::proxy::master: enable nrpe2ndoexp wrapper on haproxy_failover [puppet] - 10https://gerrit.wikimedia.org/r/1196925 (https://phabricator.wikimedia.org/T407137) [14:48:01] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [14:48:12] (03CR) 10Ssingh: [C:03+1] haproxy: enable nrpe2nodexp wrapper on haproxy_alive check [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [14:48:39] 10ops-eqiad, 06SRE, 06DC-Ops: ML Server Models incorrectly entered into Netbox - https://phabricator.wikimedia.org/T407635#11285512 (10Jclark-ctr) [14:49:10] (03Abandoned) 10Cathal Mooney: LVS: Add new sub-interfaces to LVS in eqiad for rack e8 and f8 vlans [puppet] - 10https://gerrit.wikimedia.org/r/1127134 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [14:49:16] (03CR) 10Tiziano Fogli: haproxy: enable nrpe2nodexp wrapper on haproxy_alive check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [14:49:26] (03CR) 10CI reject: [V:04-1] Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [14:50:03] (03CR) 10Santiago Faci: "Looks good. Just a suggestion about adding a new stream that was added just today to the manual list we are going to remove" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [14:54:02] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1196925 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [15:05:40] (03PS1) 10FNegri: docker::network allow custom MTU value [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) [15:06:11] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [15:07:15] (03PS1) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196930 (https://phabricator.wikimedia.org/T407630) [15:08:08] (03CR) 10Felds: [C:03+1] Add new Nokia switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [15:08:10] (03CR) 10CI reject: [V:04-1] Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196930 (https://phabricator.wikimedia.org/T407630) (owner: 10Superpes15) [15:08:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:37] (03CR) 10Cathal Mooney: "Hey thanks for the review! I will merge this next week <3" [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [15:09:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11285623 (10bking) Sorry, I've been head down on other things and did not see that this ticket was closed. Let me share a bit more... [15:09:30] (03PS2) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196930 (https://phabricator.wikimedia.org/T407630) [15:10:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11285625 (10bking) 05Declined→03Open [15:12:24] 10SRE-swift-storage: File missing from four datacenters - https://phabricator.wikimedia.org/T407589#11285636 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Thanks; our weekly rclone job would have caught up with this on Monday, but it's nice to have it resolved sooner :) [15:15:57] (03PS2) 10FNegri: docker::network allow custom MTU value [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) [15:16:09] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [15:17:37] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11285672 (10Jhancock.wm) @elukey whoops! I actually had to open it in a new browser to get it to work. but should be accessible now. If all else fails, try accessing the console... [15:20:17] (03PS3) 10FNegri: docker::network allow custom MTU value [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) [15:20:29] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [15:31:13] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11285716 (10CDanis) Are there any early estimates of the expected %age increase in something like logged-in daily active users? [15:33:14] !log Ran `mwscript-k8s --comment='First emails to users to get them to confirm their email address for T58074' extensions/WikimediaMaintenance/sendVerifyEmailReminderNotification.php --wiki=metawiki 20250917000000` [15:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:18] T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074 [15:34:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:22] (03PS1) 10Bking: site.pp: Add ganeti-jumbo hosts [puppet] - 10https://gerrit.wikimedia.org/r/1196935 (https://phabricator.wikimedia.org/T405964) [15:38:28] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407555#11285753 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:43:54] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11285758 (10Jhancock.wm) [15:44:57] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11285759 (10Jhancock.wm) [15:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:45:35] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11285760 (10Jhancock.wm) making updates to the new power limits reached in the meeting. 4323 4300 1650 [15:50:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11285769 (10elukey) >>! In T406656#11285623, @bking wrote: > Sorry, I've been head down on other things and did not see that this ticke... [15:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:55:43] (03PS2) 10CDanis: haproxy: silent-drop: lower limit [puppet] - 10https://gerrit.wikimedia.org/r/1196723 [15:57:21] (03CR) 10Felds: [C:03+1] "No problem, this seemed pretty straightforward 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [16:01:36] (03PS8) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [16:01:40] !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2058'] [16:02:39] (03CR) 10Jcrespo: "Looks fine to me, but let's deploy next week while we are on top of things." [puppet] - 10https://gerrit.wikimedia.org/r/1196925 (https://phabricator.wikimedia.org/T407137) (owner: 10Tiziano Fogli) [16:04:19] (03CR) 10BBlack: [C:03+1] varnish: WMF-Uniq -> Analytics: fix frequency bug [puppet] - 10https://gerrit.wikimedia.org/r/1196154 (https://phabricator.wikimedia.org/T405783) (owner: 10CDanis) [16:05:25] jhathaway@cumin2002 upgrade-firmware (PID 2599553) is awaiting input [16:05:56] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11285785 (10BCornwall) 05Stalled→03In progress I was able to get in contact and the domain transfer will begin shortly. Unfortunately, services will be disrupted for a short while as we initiat... [16:08:33] (03PS1) 10Tiziano Fogli: dbbackups: enable nrpe2nodexp wrapper on mariadb_${type}_... checks [puppet] - 10https://gerrit.wikimedia.org/r/1196939 (https://phabricator.wikimedia.org/T315866) [16:09:21] jhathaway@cumin2002 upgrade-firmware (PID 2599553) is awaiting input [16:09:32] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2058'] [16:11:41] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1196939 (https://phabricator.wikimedia.org/T315866) (owner: 10Tiziano Fogli) [16:13:16] (03CR) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [16:13:55] (03PS1) 10Btullis: Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) [16:17:36] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7296/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [16:25:48] (03PS1) 10Tiziano Fogli: monitoring: enable nrpe2nodexp wrapper on _owned [puppet] - 10https://gerrit.wikimedia.org/r/1196943 (https://phabricator.wikimedia.org/T407120) [16:32:05] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1196943 (https://phabricator.wikimedia.org/T407120) (owner: 10Tiziano Fogli) [16:43:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:47:36] (03PS7) 10Scott French: P:cache::haproxy: start preparing for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) [16:50:49] (03CR) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [16:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:55:16] (03PS8) 10Scott French: P:cache::haproxy: start preparing for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) [17:08:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:09:24] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:11:14] (03PS1) 10Btullis: Configure reprepro to mirror upstream opensearch2 and opensearch3 repos [puppet] - 10https://gerrit.wikimedia.org/r/1196949 (https://phabricator.wikimedia.org/T407123) [17:13:13] (03CR) 10Dzahn: "just wondering why add the /wiki at all when we don't for other redirects above" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [17:13:31] (03CR) 10Btullis: "Do not merge until the following changes have also been merged:" [puppet] - 10https://gerrit.wikimedia.org/r/1196949 (https://phabricator.wikimedia.org/T407123) (owner: 10Btullis) [17:18:46] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11286029 (10violetwtf) @BCornwall I would like to take a moment to commend your persistence through all of this. 2.5 years later, here we go! [17:22:01] (03CR) 10Violetwtf: "easy to lose context after 2.5 years but enwp.org already exists and is being donated to WMF. enwp.org/URL_shortener -> en.wikipedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [17:23:41] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:25:01] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586#11286043 (10MoritzMuehlenhoff) Did you check that how the server has been provisioned? Maybe it got provisioned with UEFI, so if you try to install with a BIOS Partman recipe it would f... [17:26:06] 10ops-eqiad, 06SRE, 06DC-Ops: ML Server Models incorrectly entered into Netbox - https://phabricator.wikimedia.org/T407635#11286046 (10VRiley-WMF) 05Open→03Resolved Updated ps2 and these PDU's should be reflecting the correct models now [17:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:31:15] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11286089 (10Dzahn) Seconded! It's great to see old domain tickets being handled. Thank you, Brett. [17:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:28] (03PS1) 10Bking: ganeti-jumbo: Add hosts and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196952 (https://phabricator.wikimedia.org/T405964) [17:46:48] (03CR) 10CI reject: [V:04-1] ganeti-jumbo: Add hosts and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196952 (https://phabricator.wikimedia.org/T405964) (owner: 10Bking) [17:48:18] (03Abandoned) 10Bking: site.pp: Add ganeti-jumbo hosts [puppet] - 10https://gerrit.wikimedia.org/r/1196935 (https://phabricator.wikimedia.org/T405964) (owner: 10Bking) [17:52:04] (03PS2) 10Bking: ganeti-jumbo: Add hosts and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196952 (https://phabricator.wikimedia.org/T405964) [17:53:18] (03PS1) 10Kamila Součková: proxoid: add discovery SAN [puppet] - 10https://gerrit.wikimedia.org/r/1196954 (https://phabricator.wikimedia.org/T407615) [17:54:49] (03CR) 10Kamila Součková: "Thanks to claime for beating me to finding the problem :D" [puppet] - 10https://gerrit.wikimedia.org/r/1196954 (https://phabricator.wikimedia.org/T407615) (owner: 10Kamila Součková) [17:58:05] (03CR) 10Bking: [C:03+1] Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [18:06:10] 06SRE, 10Data Pipelines, 06Data-Engineering, 06Data-Engineering-Icebox, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#11286186 (10LucasWerkmeister) 05Open→03Resolved I believe this task can now be closed (not sure which status is best, l... [18:15:37] (03CR) 10Btullis: [C:03+1] "Nice,thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) (owner: 10Brouberol) [18:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:19:32] (03CR) 10Dzahn: [C:03+1] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [18:25:56] (03PS1) 10Andrew Bogott: Add temporary raid10-4dev-trixie.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196957 [18:28:19] (03CR) 10CI reject: [V:04-1] Add temporary raid10-4dev-trixie.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196957 (owner: 10Andrew Bogott) [18:29:10] (03PS2) 10Andrew Bogott: Add temporary raid10-4dev-trixie.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196957 [18:31:33] (03CR) 10CI reject: [V:04-1] Add temporary raid10-4dev-trixie.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196957 (owner: 10Andrew Bogott) [18:38:28] (03CR) 10Bking: [C:03+1] multirootca: add the client auth usage to the dse_k8s discovery issuer profile [puppet] - 10https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) (owner: 10Brouberol) [18:40:56] (03CR) 10Ssingh: proxoid: add discovery SAN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196954 (https://phabricator.wikimedia.org/T407615) (owner: 10Kamila Součková) [18:41:28] (03PS3) 10Andrew Bogott: Add temporary raid10-4dev-trixie.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196957 [18:44:11] (03CR) 10Andrew Bogott: [C:03+2] Add temporary raid10-4dev-trixie.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1196957 (owner: 10Andrew Bogott) [18:45:04] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [18:47:05] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [18:47:51] (03CR) 10Ssingh: proxoid: add discovery SAN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196954 (https://phabricator.wikimedia.org/T407615) (owner: 10Kamila Součková) [18:48:04] (03PS9) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [18:48:34] (03CR) 10BCornwall: "Updated the commit message to also mention that in case a future blamer goes through." [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:01:43] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1196533 (https://phabricator.wikimedia.org/T406689) (owner: 10Andrea Denisse) [19:04:42] (03CR) 10Scott French: "So, it turns out option #1 is pretty straightforward: https://gitlab.wikimedia.org/repos/sre/hiddenparma/-/merge_requests/120" [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [19:11:39] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:11:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:16:45] (03CR) 10BCornwall: [C:03+2] ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:18:56] (03PS10) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [19:20:19] (03CR) 10Dzahn: [C:03+1] ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:25:38] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11286514 (10ssingh) Nice job indeed in pursuing this over the years, Brett! [19:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:50:18] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:51:24] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [19:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown