[00:07:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193960 [00:07:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193960 (owner: 10TrainBranchBot) [00:12:11] (03CR) 10Dzahn: osm_master: Store kartotherian and tegola passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [00:27:04] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2005-dev.codfw.wmnet with OS trixie [00:28:20] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193960 (owner: 10TrainBranchBot) [00:36:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:53] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:01:07] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.22 [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1193961 (https://phabricator.wikimedia.org/T405678) [01:08:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.22 [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1193961 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [01:11:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/1/6 (Transit: ... [01:11:51] NTT (234630) {#3475}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [01:12:50] !incidents [01:12:51] 6837 (UNACKED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [01:13:02] o/ [01:13:14] !incidents [01:13:14] 6837 (UNACKED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [01:13:25] !ack 6837 [01:13:26] 6837 (ACKED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [01:14:35] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 28s) [01:24:42] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.22 [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1193961 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [01:36:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/1/6 (Transit: ... [01:41:51] NTT (234630) {#3475}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [01:51:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T0200) [02:25:24] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:25:26] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:26:10] FIRING: BFDdown: BFD session down between cr2-magru and 195.200.68.152 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:27:24] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:27:26] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:27:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (2a02:ec80:700:fe0b::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:33:02] RESOLVED: BFDdown: BFD session down between cr2-magru and 195.200.68.152 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:33:02] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (2a02:ec80:700:fe0b::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:42:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T0300) [03:02:00] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193967 (https://phabricator.wikimedia.org/T405678) [03:02:03] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193967 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [03:03:02] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193967 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [03:03:34] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.45.0-wmf.22 refs T405678 [03:03:37] T405678: 1.45.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T405678 [03:09:52] on-callers: There was someone scraping upload for a while on digital ocean instances - we added a requestctl rule to limit them: https://requestctl.wikimedia.org/action/cache-upload/limit_a075ef_ja3n_upload_scraper [03:10:20] oops, wrong channel ^^; [03:10:37] * brett jumps out the window [03:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:48:52] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.45.0-wmf.22 refs T405678 (duration: 45m 18s) [03:48:55] T405678: 1.45.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T405678 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T0400) [04:00:28] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:02:34] !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.19 (duration: 02m 32s) [04:04:12] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [04:38:18] (03CR) 10Giuseppe Lavagetto: [C:03+1] P:conftool::requestctl_client: update requestctl_cli.original.py [puppet] - 10https://gerrit.wikimedia.org/r/1192616 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [04:38:53] (03CR) 10Giuseppe Lavagetto: [C:03+1] P:conftool::hiddenparma: enable known_client_expression_validation [puppet] - 10https://gerrit.wikimedia.org/r/1192620 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [04:44:53] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:02:12] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1018.eqiad.wmnet with OS bullseye [05:03:01] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:17] (03PS1) 10Marostegui: db2237: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1193984 (https://phabricator.wikimedia.org/T406541) [05:35:59] (03CR) 10Marostegui: [C:03+2] db2237: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1193984 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [05:36:25] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2237.codfw.wmnet with reason: Maintenance [05:36:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2237 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83617 and previous config saved to /var/cache/conftool/dbconfig/20251007-053628-root.json [05:36:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2237.codfw.wmnet with reason: Maintenance [05:39:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11248455 (10Marostegui) 05Open→03Resolved This is back to optimal ` [11336020.472994] scsi 0:2:0:0: Direct-Access DELL PERC H745 Frnt 5.16 PQ: 0 ANSI: 5 [11336020.473833] sd 0:2:... [05:44:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2237 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83618 and previous config saved to /var/cache/conftool/dbconfig/20251007-054457-root.json [05:49:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248460 (10Marostegui) 05Resolved→03Open @VRiley-WMF es1050 doesn't seem to be installed correctly, I am investigating [05:51:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:02] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1050.eqiad.wmnet with OS bookworm [05:52:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm [06:00:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2237 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83619 and previous config saved to /var/cache/conftool/dbconfig/20251007-060003-root.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T0600) [06:00:04] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T0600). [06:00:48] marostegui: OK to go with cxserver deployment? [06:00:57] kart_: yes [06:02:07] Thanks [06:02:13] (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update to 2025-10-06-084053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193821 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [06:04:02] (03Merged) 10jenkins-bot: cxserver: staging: Update to 2025-10-06-084053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193821 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [06:06:32] (03PS21) 10Ryan Kemper: Replace elasticsearch lib w/ spicerack APIClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [06:06:36] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:07:01] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:14:34] (03PS1) 10DCausse: NetworkSession: enable only for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193988 [06:15:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2237 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83620 and previous config saved to /var/cache/conftool/dbconfig/20251007-061509-root.json [06:15:41] (03PS1) 10KartikMistry: cxserver: Update to 2025-10-06-084053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193989 (https://phabricator.wikimedia.org/T394982) [06:16:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193052 (https://phabricator.wikimedia.org/T389053) (owner: 10DCausse) [06:16:34] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:19:40] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:24:22] 06SRE, 06Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021#11248577 (10ABran-WMF) [06:24:26] !log rebalance Ganeti codfw/B following vmscape reboots [06:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:45] !log rebalance Ganeti eqiad/B following vmscape reboots [06:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:38] (03PS3) 10DCausse: cirrus: test completion with default sort on simplewiki [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193092 (https://phabricator.wikimedia.org/T404858) [06:26:38] (03PS3) 10DCausse: cirrus: test completion with default sort on simplewiki [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193093 (https://phabricator.wikimedia.org/T404858) [06:27:54] (03CR) 10Arnaudb: gerrit: mod_qos tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [06:30:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2237 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83621 and previous config saved to /var/cache/conftool/dbconfig/20251007-063014-root.json [06:30:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193092 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [06:32:27] (03PS1) 10Marostegui: db2219: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194028 (https://phabricator.wikimedia.org/T406541) [06:32:45] (03CR) 10KartikMistry: [C:03+2] cxserver: Update to 2025-10-06-084053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193989 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [06:34:16] (03Merged) 10jenkins-bot: cxserver: Update to 2025-10-06-084053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193989 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [06:35:49] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1050.eqiad.wmnet with OS bookworm [06:36:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248602 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm executed with errors: - es1050... [06:37:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1050.eqiad.wmnet with OS bookworm [06:37:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm [06:38:31] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for the Postfix Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193822 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:40:10] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:40:41] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:40:52] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [06:41:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [06:42:06] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:42:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:42:41] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:44:04] !log Updated cxserver to 2025-10-06-084053-production (T394982, T403574) [06:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:09] T394982: Migrate cxserver in production to node22 - https://phabricator.wikimedia.org/T394982 [06:44:09] T403574: Special:AutomaticTranslations - Title not shown for article - https://phabricator.wikimedia.org/T403574 [06:45:51] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [06:49:24] (03CR) 10Marostegui: [C:03+2] db2219: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194028 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [06:50:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2219.codfw.wmnet with reason: Maintenance [06:50:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2219 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83622 and previous config saved to /var/cache/conftool/dbconfig/20251007-065019-marostegui.json [06:50:33] !Incidents [06:51:07] !incidents [06:51:07] 6837 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [06:51:43] (03PS1) 10Muehlenhoff: Use wmflib::dir::mkdir_p to create /etc/wikimedia/maps [puppet] - 10https://gerrit.wikimedia.org/r/1194105 [06:52:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194105 (owner: 10Muehlenhoff) [06:58:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2219 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83623 and previous config saved to /var/cache/conftool/dbconfig/20251007-065825-root.json [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T0700). [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] o/ [07:00:27] I can deploy [07:01:49] (03CR) 10Elukey: sre.hardware.upgrade-firmware: fix ssd upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [07:02:13] (03CR) 10Majavah: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) (owner: 10BryanDavis) [07:02:29] (03CR) 10Jelto: [C:03+1] "this is no longer blocked by the gerrit upgrade and looks good to me, let me know how you want to proceed. I'm not sure if the dedicated s" [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [07:03:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193052 (https://phabricator.wikimedia.org/T389053) (owner: 10DCausse) [07:03:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193092 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:04:31] (03Merged) 10jenkins-bot: cirrus: stop copying ores weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193052 (https://phabricator.wikimedia.org/T389053) (owner: 10DCausse) [07:04:34] (03Merged) 10jenkins-bot: cirrus: test completion with default sort on simplewiki [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193092 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:05:33] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1193052|cirrus: stop copying ores weighted_tags (T389053)]], [[gerrit:1193092|cirrus: test completion with default sort on simplewiki [2/3] (T404858)]] [07:05:38] T389053: Rename weighted_tags referencing ores in their names - https://phabricator.wikimedia.org/T389053 [07:05:38] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:05:46] (03CR) 10Joal: "Two mistakes and one file missing (the template file for the new `centralauth_prodution script). You'll also need a new `sqoop_file` in re" [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [07:06:40] (03PS1) 10Jelto: phabricator: delay pages by 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1194108 (https://phabricator.wikimedia.org/T406338) [07:10:42] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1050.eqiad.wmnet with OS bookworm [07:10:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm executed with errors: - es1050... [07:11:53] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1193052|cirrus: stop copying ores weighted_tags (T389053)]], [[gerrit:1193092|cirrus: test completion with default sort on simplewiki [2/3] (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:11:57] T389053: Rename weighted_tags referencing ores in their names - https://phabricator.wikimedia.org/T389053 [07:11:58] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:12:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1050.eqiad.wmnet with OS bookworm [07:12:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm [07:13:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2219 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83624 and previous config saved to /var/cache/conftool/dbconfig/20251007-071331-root.json [07:14:37] !log dcausse@deploy2002 dcausse: Continuing with sync [07:17:01] (03PS4) 10Arnaudb: gerrit: fix typo in source path [cookbooks] - 10https://gerrit.wikimedia.org/r/1193860 (https://phabricator.wikimedia.org/T387833) [07:18:12] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1194108 (https://phabricator.wikimedia.org/T406338) (owner: 10Jelto) [07:21:05] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193052|cirrus: stop copying ores weighted_tags (T389053)]], [[gerrit:1193092|cirrus: test completion with default sort on simplewiki [2/3] (T404858)]] (duration: 15m 32s) [07:21:10] T389053: Rename weighted_tags referencing ores in their names - https://phabricator.wikimedia.org/T389053 [07:21:10] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:28:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2219 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83625 and previous config saved to /var/cache/conftool/dbconfig/20251007-072837-root.json [07:33:24] marostegui@cumin1003 reimage (PID 1635517) is awaiting input [07:33:38] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1050.eqiad.wmnet with OS bookworm [07:33:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm executed with errors: - es1050... [07:34:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248706 (10Marostegui) The first issue was that the host was running puppet 5. Now after running the whole process again, it got stuck on the installer, which is taking ages an... [07:34:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1050.eqiad.wmnet with OS bookworm [07:34:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm [07:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:43:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2219 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83626 and previous config saved to /var/cache/conftool/dbconfig/20251007-074342-root.json [07:55:14] marostegui@cumin1003 reimage (PID 1638460) is awaiting input [07:59:08] (03PS1) 10Marostegui: db2210: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194122 (https://phabricator.wikimedia.org/T406541) [07:59:52] (03CR) 10Marostegui: [C:03+2] db2210: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194122 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [08:00:05] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T0800) [08:00:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2210.codfw.wmnet with reason: Maintenance [08:00:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2210 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83627 and previous config saved to /var/cache/conftool/dbconfig/20251007-080015-marostegui.json [08:00:17] morning, train will rollout in 5m [08:02:29] (03PS1) 10Slyngshede: site.pp deploy Tomcat/CAS to idp_test1005 [puppet] - 10https://gerrit.wikimedia.org/r/1194123 (https://phabricator.wikimedia.org/T406455) [08:04:02] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7209/console" [puppet] - 10https://gerrit.wikimedia.org/r/1194123 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:04:53] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [08:05:27] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194125 (https://phabricator.wikimedia.org/T405678) [08:05:30] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194125 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [08:05:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194123 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:06:35] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194125 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [08:06:52] !log installing libsndfile security updates [08:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:59] (03CR) 10Slyngshede: [V:03+1 C:03+2] site.pp deploy Tomcat/CAS to idp_test1005 [puppet] - 10https://gerrit.wikimedia.org/r/1194123 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:08:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2210 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83628 and previous config saved to /var/cache/conftool/dbconfig/20251007-080803-root.json [08:08:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248764 (10Marostegui) The issue is that the host keeps booting into the installer on a loop even if I disabled the PXE boot via IPMI [08:16:58] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.22 refs T405678 [08:17:02] T405678: 1.45.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T405678 [08:20:04] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1050.eqiad.wmnet with OS bookworm [08:20:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248793 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm executed with errors: - es1050... [08:23:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2210 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83629 and previous config saved to /var/cache/conftool/dbconfig/20251007-082309-root.json [08:24:22] (03CR) 10Elukey: [C:03+1] "I am a little hesitant since mkdir_p uses ensure_resources, that will re-create a file/directory resource if it doesn't match with one alr" [puppet] - 10https://gerrit.wikimedia.org/r/1194105 (owner: 10Muehlenhoff) [08:24:59] 06SRE, 06Traffic, 06MediaWiki-Platform-Team (Radar): Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11248799 (10Vgutierrez) Do we know what the current behavior is for layers that set `X-Request-ID`? The usual approach for HAProxy is to... [08:27:40] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host es1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:32:33] (03CR) 10Stevemunene: [C:03+2] Define airflow-wikidata airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:33:16] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193860 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:34:14] (03Merged) 10jenkins-bot: Define airflow-wikidata airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:37:14] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:37:22] (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2052 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194127 (https://phabricator.wikimedia.org/T402859) [08:37:24] (03PS1) 10Federico Ceratto: es2052.yaml: Prepare es2052 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1194128 (https://phabricator.wikimedia.org/T402859) [08:37:55] !log Stopped Gerrit on gerrit2003, deleted /srv/gerrit/git/* and restarted a full replication due to bad files ownership # T387833 [08:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:58] T387833: Gerrit failover process - https://phabricator.wikimedia.org/T387833 [08:38:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2210 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83630 and previous config saved to /var/cache/conftool/dbconfig/20251007-083814-root.json [08:39:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248845 (10elukey) re-run provisioning: ` Updated value for attribute BIOS.Setup.1-1 -> SetBootOrderEn: NIC.Embedded.1-1-1,HardDisk.List.1-1 => HardDisk.List.1-1,NIC.Embedded.... [08:41:41] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [08:42:08] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [08:42:30] !log tighten up acl for ssh access on pfw1-codfw T390939 [08:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:53] (03CR) 10Jelto: [C:03+2] phabricator: delay pages by 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1194108 (https://phabricator.wikimedia.org/T406338) (owner: 10Jelto) [08:44:21] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1050.eqiad.wmnet with OS bookworm [08:44:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11248877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm [08:44:53] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:45:13] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:45:35] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:45:49] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:45:51] (03PS1) 10Santiago Faci: xLab: Deploying v1.0.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194129 (https://phabricator.wikimedia.org/T404001) [08:47:04] (03PS1) 10Santiago Faci: xLab: Deploying v1.0.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194130 (https://phabricator.wikimedia.org/T404001) [08:50:17] elukey@cumin1003 provision (PID 1645237) is awaiting input [08:51:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11248907 (10Gehel) [08:51:24] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Upgrade Envoy to v1.29.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11248912 (10Gehel) [08:51:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting Kerberos access for sd - https://phabricator.wikimedia.org/T405219#11248920 (10Gehel) [08:52:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11248928 (10Gehel) [08:52:33] (03CR) 10Phuedx: [C:03+1] xLab: Deploying v1.0.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194129 (https://phabricator.wikimedia.org/T404001) (owner: 10Santiago Faci) [08:52:44] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:52:59] (03CR) 10Phuedx: [C:03+1] xLab: Deploying v1.0.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194130 (https://phabricator.wikimedia.org/T404001) (owner: 10Santiago Faci) [08:53:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11248932 (10Gehel) [08:53:10] (03PS1) 10Slyngshede: R:idp-test enable idp-test1005 [puppet] - 10https://gerrit.wikimedia.org/r/1194131 [08:53:16] 07sre-alert-triage, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11248936 (10Gehel) [08:53:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2210 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83631 and previous config saved to /var/cache/conftool/dbconfig/20251007-085320-root.json [08:53:49] FIRING: HelmReleaseBadStatus: Helm release airflow-wikidata/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-wikidata - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:53:50] (03CR) 10Phuedx: [C:03+2] xLab: Deploying v1.0.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194129 (https://phabricator.wikimedia.org/T404001) (owner: 10Santiago Faci) [08:53:54] (03CR) 10Phuedx: [C:03+2] xLab: Deploying v1.0.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194130 (https://phabricator.wikimedia.org/T404001) (owner: 10Santiago Faci) [08:54:06] 06SRE: FY 25/26 WE 5.4.5: Enforce global rate-limits - https://phabricator.wikimedia.org/T406545#11248940 (10Peachey88) [08:55:01] (03Merged) 10jenkins-bot: xLab: Deploying v1.0.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194129 (https://phabricator.wikimedia.org/T404001) (owner: 10Santiago Faci) [08:55:03] (03Merged) 10jenkins-bot: xLab: Deploying v1.0.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194130 (https://phabricator.wikimedia.org/T404001) (owner: 10Santiago Faci) [08:55:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194131 (owner: 10Slyngshede) [08:55:45] (03PS3) 10Tiziano Fogli: metamonitoring: avoid unnecessary public endpoint restarts [puppet] - 10https://gerrit.wikimedia.org/r/1194121 (https://phabricator.wikimedia.org/T397003) [08:55:46] (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since it's a minor change to avoid unnecessary restarts that could cause unwanted timeouts from HetrixTools." [puppet] - 10https://gerrit.wikimedia.org/r/1194121 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [08:56:13] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11248948 (10elukey) After the cache refresh in codfw: ` | | ssim | |-----:|---------:| | 0.05 | 0.955841 | | 0.1 | 0.975403 | | 0.2 | 0.989549 | | 0.25 | 0.992865 | | 0.5... [08:57:17] (03CR) 10Slyngshede: [C:03+2] R:idp-test enable idp-test1005 [puppet] - 10https://gerrit.wikimedia.org/r/1194131 (owner: 10Slyngshede) [08:57:47] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:58:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:59:32] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:00:50] elukey@cumin1003 provision (PID 1645237) is awaiting input [09:02:14] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1050.eqiad.wmnet with reason: host reimage [09:04:08] RECOVERY - gerrit process on gerrit2003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [09:04:39] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:04:48] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [09:05:29] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [09:06:43] !log aqu@deploy2002 Started deploy [analytics/refinery@21fe78f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@21fe78fb] [09:07:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1050.eqiad.wmnet with reason: host reimage [09:07:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554 (10cmooney) 03NEW p:05Triage→03High [09:07:56] !log aqu@deploy2002 Finished deploy [analytics/refinery@21fe78f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@21fe78fb] (duration: 01m 12s) [09:08:14] 06SRE, 10Hiddenparma: FY 25/26 WE 5.4.5: Enforce global rate-limits - https://phabricator.wikimedia.org/T406545#11249021 (10Joe) [09:08:19] 07Puppet, 06Data-Engineering, 06Data-Engineering-Icebox, 10observability: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#11249022 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We have 0.15.0 running fleet-wide, resolving this t... [09:08:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1029 to clone es1049 T406488', diff saved to https://phabricator.wikimedia.org/P83633 and previous config saved to /var/cache/conftool/dbconfig/20251007-090826-marostegui.json [09:08:30] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [09:10:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool es1029 and depool es1026 to clone es1049 T406488', diff saved to https://phabricator.wikimedia.org/P83634 and previous config saved to /var/cache/conftool/dbconfig/20251007-091011-marostegui.json [09:11:43] (03PS13) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [09:11:52] (03PS14) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [09:11:58] (03PS1) 10Marostegui: mariadb: Productionize es1049 [puppet] - 10https://gerrit.wikimedia.org/r/1194136 (https://phabricator.wikimedia.org/T406488) [09:12:12] !log aqu@deploy2002 Started deploy [analytics/refinery@21fe78f]: Regular analytics weekly train [analytics/refinery@21fe78fb] [09:12:32] (03CR) 10Arnaudb: [C:03+2] gerrit: fix typo in source path [cookbooks] - 10https://gerrit.wikimedia.org/r/1193860 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:14:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1026,1049].eqiad.wmnet with reason: Cloning [09:14:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11249042 (10cmooney) @BCornwall I'm hoping to make progress on this one, can you review the gerrit patch when you have a moment? In terms of how to... [09:15:37] (03PS2) 10Marostegui: mariadb: Productionize es1049 [puppet] - 10https://gerrit.wikimedia.org/r/1194136 (https://phabricator.wikimedia.org/T406488) [09:15:39] (03CR) 10Elukey: [C:03+1] wmflib: discard new directory entries from swift_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) (owner: 10MVernon) [09:16:15] (03CR) 10Ladsgroup: [C:03+1] es2052.yaml: Prepare es2052 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1194128 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:16:25] (03CR) 10Ladsgroup: [C:03+1] preseed.yaml: Remove es2052 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194127 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:16:35] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1049 [puppet] - 10https://gerrit.wikimedia.org/r/1194136 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [09:17:17] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2052.codfw.wmnet'] [09:18:27] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:18:29] (03CR) 10Federico Ceratto: [C:03+2] es2052.yaml: Prepare es2052 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1194128 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:18:35] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2052 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194127 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:18:55] (03Merged) 10jenkins-bot: gerrit: fix typo in source path [cookbooks] - 10https://gerrit.wikimedia.org/r/1193860 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:19:11] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [09:19:29] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1026,1049].eqiad.wmnet with reason: Cloning [09:19:58] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [09:20:30] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:20:54] (03PS1) 10Matthias Mullie: Remove P373 results form custommatch:linked_from [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194138 [09:22:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1026.eqiad.wmnet onto es1049.eqiad.wmnet [09:24:04] (03PS1) 10Muehlenhoff: acmechief: Add missing record for idp-test1005 [puppet] - 10https://gerrit.wikimedia.org/r/1194140 (https://phabricator.wikimedia.org/T406455) [09:24:56] (03CR) 10Slyngshede: [C:03+1] acmechief: Add missing record for idp-test1005 [puppet] - 10https://gerrit.wikimedia.org/r/1194140 (https://phabricator.wikimedia.org/T406455) (owner: 10Muehlenhoff) [09:25:41] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - marostegui@cumin1003" [09:26:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - marostegui@cumin1003" [09:26:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1050.eqiad.wmnet with OS bookworm [09:26:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11249092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host es1050.eqiad.wmnet with OS bookworm completed: - es1050 (**PASS**)... [09:27:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11249094 (10Marostegui) 05Open→03Resolved es1050 has been successfully reimaged and it is reachable now - thanks @elukey for all the help [09:27:49] jouncebot: nowandnext [09:27:49] For the next 0 hour(s) and 32 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T0800) [09:27:49] In 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1000) [09:27:53] (03CR) 10Slyngshede: [C:03+2] acmechief: Add missing record for idp-test1005 [puppet] - 10https://gerrit.wikimedia.org/r/1194140 (https://phabricator.wikimedia.org/T406455) (owner: 10Muehlenhoff) [09:28:49] RESOLVED: HelmReleaseBadStatus: Helm release airflow-wikidata/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-wikidata - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:30:13] (03PS15) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [09:32:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:33:34] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:33:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on es2052.codfw.wmnet with reason: Setting up new ES host [09:33:48] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:39:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2052.codfw.wmnet with reason: Setting up new ES host [09:41:02] (03PS1) 10Jelto: gitlab: add check for object storage credentials [puppet] - 10https://gerrit.wikimedia.org/r/1194144 (https://phabricator.wikimedia.org/T406234) [09:46:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:47:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on es2029.codfw.wmnet with reason: Setting up new ES host [09:47:37] (03PS1) 10Muehlenhoff: Record LDAP access for atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1194145 [09:47:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2052.codfw.wmnet with reason: Setting up new ES host [09:48:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11249202 (10elukey) @Jhancock.wm tried again, then reset the idrac on 2056, re-run again but same error :( I've reset the IDRAC for cp2052 and I was able to up... [09:49:01] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7212/console" [puppet] - 10https://gerrit.wikimedia.org/r/1194144 (https://phabricator.wikimedia.org/T406234) (owner: 10Jelto) [09:49:11] (03PS2) 10Muehlenhoff: Record LDAP access for atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1194145 [09:49:23] (03CR) 10Ladsgroup: [C:03+1] Undeploy FlaggedRevs from lawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193564 (https://phabricator.wikimedia.org/T406424) (owner: 10Ladsgroup) [09:49:26] (03CR) 10Ladsgroup: [C:03+2] Undeploy FlaggedRevs from lawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193564 (https://phabricator.wikimedia.org/T406424) (owner: 10Ladsgroup) [09:50:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193564 (https://phabricator.wikimedia.org/T406424) (owner: 10Ladsgroup) [09:51:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:04] (03Merged) 10jenkins-bot: Undeploy FlaggedRevs from lawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193564 (https://phabricator.wikimedia.org/T406424) (owner: 10Ladsgroup) [09:52:20] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2052.codfw.wmnet'] [09:52:37] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1193564|Undeploy FlaggedRevs from lawikisource (T406424)]] [09:52:40] T406424: Removed the FlaggedRevs extension on la.ws - https://phabricator.wikimedia.org/T406424 [09:53:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1028 to clone es1051 T406488', diff saved to https://phabricator.wikimedia.org/P83635 and previous config saved to /var/cache/conftool/dbconfig/20251007-095339-marostegui.json [09:53:43] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [09:53:55] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1194145 (owner: 10Muehlenhoff) [09:54:16] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: add check for object storage credentials [puppet] - 10https://gerrit.wikimedia.org/r/1194144 (https://phabricator.wikimedia.org/T406234) (owner: 10Jelto) [09:54:46] !log aqu@deploy2002 Finished deploy [analytics/refinery@21fe78f]: Regular analytics weekly train [analytics/refinery@21fe78fb] (duration: 42m 33s) [09:55:00] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for es2029.codfw.wmnet [09:55:01] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2029.codfw.wmnet [09:55:37] (03PS1) 10Marostegui: mariadb: Productionize es1051 [puppet] - 10https://gerrit.wikimedia.org/r/1194148 (https://phabricator.wikimedia.org/T406488) [09:55:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1028,1051].eqiad.wmnet with reason: Cloning [09:56:35] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1051 [puppet] - 10https://gerrit.wikimedia.org/r/1194148 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [09:56:37] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone_es of es2027.codfw.wmnet onto es2052.codfw.wmnet [09:56:41] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool es2027.codfw.wmnet to then clone it to es2052.codfw.wmnet - fceratto@cumin1002 [09:56:59] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1193564|Undeploy FlaggedRevs from lawikisource (T406424)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:56:59] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2027 - Depool es2027.codfw.wmnet to then clone it to es2052.codfw.wmnet - fceratto@cumin1002 [09:57:49] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [09:58:01] !log aqu@deploy2002 Started deploy [analytics/refinery@21fe78f] (thin): Regular analytics weekly train THIN [analytics/refinery@21fe78fb] [09:59:06] !log aqu@deploy2002 Finished deploy [analytics/refinery@21fe78f] (thin): Regular analytics weekly train THIN [analytics/refinery@21fe78fb] (duration: 01m 05s) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1000) [10:00:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1028.eqiad.wmnet onto es1051.eqiad.wmnet [10:01:01] (03PS1) 10Reedy: Force OATHManage to be on central domain [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194150 (https://phabricator.wikimedia.org/T401773) [10:02:11] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193564|Undeploy FlaggedRevs from lawikisource (T406424)]] (duration: 09m 34s) [10:02:15] T406424: Removed the FlaggedRevs extension on la.ws - https://phabricator.wikimedia.org/T406424 [10:03:26] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e5-eqiad [10:03:32] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e5-eqiad [10:03:35] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e6-eqiad [10:03:41] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e6-eqiad [10:03:43] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f5-eqiad [10:03:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f5-eqiad [10:03:52] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e7-eqiad [10:03:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e7-eqiad [10:04:00] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f6-eqiad [10:04:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f6-eqiad [10:04:09] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f7-eqiad [10:04:15] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f7-eqiad [10:04:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:08:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134663 (https://phabricator.wikimedia.org/T389893) (owner: 10Ladsgroup) [10:09:12] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:09:12] (03Merged) 10jenkins-bot: mainstash: Disable multiPrimaryMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134663 (https://phabricator.wikimedia.org/T389893) (owner: 10Ladsgroup) [10:09:45] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1134663|mainstash: Disable multiPrimaryMode (T389893)]] [10:09:48] T389893: Remove modtoken and multiPrimaryMode from SqlBagOStuff and mainstash - https://phabricator.wikimedia.org/T389893 [10:12:11] (03PS1) 10Muehlenhoff: Create /etc/wikimedia in the cloud VPS base class [puppet] - 10https://gerrit.wikimedia.org/r/1194156 [10:13:27] (03CR) 10Muehlenhoff: osm_master: Create /etc/wikimedia directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [10:14:17] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1134663|mainstash: Disable multiPrimaryMode (T389893)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:14:36] (03CR) 10Hnowlan: "lgtm, some nits and notes" [puppet] - 10https://gerrit.wikimedia.org/r/1193882 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [10:16:58] 10SRE-swift-storage, 06Commons: [[commons:File:Things near the Nautical Museum of Litochoro 10.jpg]] only present in codfw - https://phabricator.wikimedia.org/T406246#11249340 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon As expected, the Monday `rclone` copied this image across: ` curl -o /d... [10:18:29] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194158 [10:19:30] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [10:20:14] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:24:36] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1134663|mainstash: Disable multiPrimaryMode (T389893)]] (duration: 14m 51s) [10:24:39] T389893: Remove modtoken and multiPrimaryMode from SqlBagOStuff and mainstash - https://phabricator.wikimedia.org/T389893 [10:25:18] cmooney@cumin1003 netbox (PID 1657561) is awaiting input [10:25:55] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:31:26] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [10:31:46] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11249385 (10cmooney) @ssingh FYI I ran the //sre.dns.netbox// cookbook just now as it alerted on being a diff, it removed the entries for hcaptcha1001. The VM doesn't... [10:34:41] (03CR) 10Hnowlan: [C:03+1] "lgtm pending an ok from traffic!" [puppet] - 10https://gerrit.wikimedia.org/r/1193903 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [10:37:06] cmooney@cumin1003 netbox (PID 1659933) is awaiting input [10:37:40] (03PS1) 10Slyngshede: IDP-Test: Switch to new Debian 13 host [dns] - 10https://gerrit.wikimedia.org/r/1194160 (https://phabricator.wikimedia.org/T406455) [10:38:33] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new dns names - cmooney@cumin1003" [10:38:36] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:38:38] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new dns names - cmooney@cumin1003" [10:38:38] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:39:31] (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194158 (owner: 10Muehlenhoff) [10:42:53] (03CR) 10Slyngshede: [C:03+2] IDP-Test: Switch to new Debian 13 host [dns] - 10https://gerrit.wikimedia.org/r/1194160 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [10:43:06] !log slyngshede@dns1004 START - running authdns-update [10:44:12] !log slyngshede@dns1004 END - running authdns-update [10:46:19] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: use mw-api-ext rather than mw-api-int for all APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193389 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [10:52:44] (03PS1) 10Marostegui: db2206: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194164 (https://phabricator.wikimedia.org/T406541) [10:53:16] (03CR) 10Marostegui: [C:03+2] db2206: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194164 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [10:53:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2206.codfw.wmnet with reason: Maintenance [10:53:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2206 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83637 and previous config saved to /var/cache/conftool/dbconfig/20251007-105337-marostegui.json [10:56:00] jouncebot: nowandnext [10:56:01] For the next 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1000) [10:56:01] In 1 hour(s) and 3 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1200) [10:58:41] (03PS1) 10FNegri: aptrepo: Add tofu package to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) [10:58:44] (03CR) 10Hnowlan: [C:03+2] rest-gateway: use mw-api-ext rather than mw-api-int for all APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193389 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [11:01:18] (03Merged) 10jenkins-bot: rest-gateway: use mw-api-ext rather than mw-api-int for all APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193389 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [11:01:19] (03CR) 10FNegri: aptrepo: Add tofu package to trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [11:01:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2206 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83638 and previous config saved to /var/cache/conftool/dbconfig/20251007-110158-root.json [11:04:01] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:04:10] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:05:39] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1194171 (https://phabricator.wikimedia.org/T406543) [11:06:08] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1194171 (https://phabricator.wikimedia.org/T406543) (owner: 10Marostegui) [11:06:37] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1194171 (https://phabricator.wikimedia.org/T406543) (owner: 10Marostegui) [11:07:35] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:07:40] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:08:09] !log rebalance Ganeti codfw/C following vmscape reboots [11:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:24] (03PS1) 10Hnowlan: rest-gateway: correct port in networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194172 (https://phabricator.wikimedia.org/T401396) [11:08:33] (03CR) 10CI reject: [V:04-1] rest-gateway: correct port in networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194172 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [11:11:25] (03CR) 10Marostegui: clone_es.py: clone readonly es* hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [11:11:59] !log imported prometheus-jmx-exporter 0.15.0 for trixie-wikimedia T406455 [11:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:02] (03PS2) 10Hnowlan: rest-gateway: correct port in networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194172 (https://phabricator.wikimedia.org/T401396) [11:12:03] T406455: Upgrade Apereo CAS to version 7.2 - https://phabricator.wikimedia.org/T406455 [11:13:00] !log imported cas 7.1.6.2 for trixie-wikimedia T406455 [11:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1169 T406543', diff saved to https://phabricator.wikimedia.org/P83639 and previous config saved to /var/cache/conftool/dbconfig/20251007-111438-marostegui.json [11:14:42] T406543: Compile and package MariaDB 10.11.14 - https://phabricator.wikimedia.org/T406543 [11:14:48] (03CR) 10Hnowlan: [C:03+2] rest-gateway: correct port in networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194172 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [11:15:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Upgrading [11:16:16] !log Upgrade db1169 (s1) to 10.11.14 T406543 [11:16:24] (03PS9) 10Clément Goubert: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 (owner: 10Daniel Kinzler) [11:16:27] (03Merged) 10jenkins-bot: rest-gateway: correct port in networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194172 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [11:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:32] (03PS23) 10Clément Goubert: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [11:17:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2206 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83640 and previous config saved to /var/cache/conftool/dbconfig/20251007-111704-root.json [11:17:10] (03CR) 10Clément Goubert: "Only change in PS22 is a rebase and bumping the `Chart.yaml` version." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [11:18:03] (03CR) 10CI reject: [V:04-1] api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [11:18:49] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:18:55] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:19:02] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:23:11] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org [11:23:16] (03PS1) 10Milimetric: Configure a web_base_with_ip stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194175 (https://phabricator.wikimedia.org/T406359) [11:23:31] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:23:44] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:25:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After upgrade to 10.11.14', diff saved to https://phabricator.wikimedia.org/P83642 and previous config saved to /var/cache/conftool/dbconfig/20251007-112501-root.json [11:25:38] (03PS11) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [11:26:18] (03PS6) 10Pmiazga: api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [11:26:59] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2005.wikimedia.org [11:27:12] (03CR) 10CI reject: [V:04-1] api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [11:27:48] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM idp-test2005.wikimedia.org [11:27:56] (03CR) 10CI reject: [V:04-1] api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [11:29:06] (03PS1) 10Hnowlan: rest-gateway: tweak restbase-compat stats label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194176 (https://phabricator.wikimedia.org/T401396) [11:30:13] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test2005.wikimedia.org [11:32:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2206 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83643 and previous config saved to /var/cache/conftool/dbconfig/20251007-113210-root.json [11:33:41] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp-test2005.wikimedia.org with OS trixie [11:38:46] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: tweak restbase-compat stats label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194176 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [11:39:42] (03CR) 10Hnowlan: [C:03+2] rest-gateway: tweak restbase-compat stats label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194176 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [11:40:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After upgrade to 10.11.14', diff saved to https://phabricator.wikimedia.org/P83644 and previous config saved to /var/cache/conftool/dbconfig/20251007-114007-root.json [11:41:25] (03Merged) 10jenkins-bot: rest-gateway: tweak restbase-compat stats label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194176 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [11:42:53] (03PS1) 10Esanders: Invalidate Flow cache on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194180 (https://phabricator.wikimedia.org/T405080) [11:44:11] FIRING: JobUnavailable: Reduced availability for job jmx_idp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:44:22] (03PS10) 10Clément Goubert: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 (owner: 10Daniel Kinzler) [11:44:22] (03PS24) 10Clément Goubert: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [11:44:52] (03CR) 10Clément Goubert: "Which obviously introduced conflicts x)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: 10Daniel Kinzler) [11:46:12] (03CR) 10Clément Goubert: [C:03+1] Allow deployment group to sudo -u mwbuilder scap clean-images [puppet] - 10https://gerrit.wikimedia.org/r/1192567 (https://phabricator.wikimedia.org/T387927) (owner: 10Ahmon Dancy) [11:47:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2206 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83645 and previous config saved to /var/cache/conftool/dbconfig/20251007-114716-root.json [11:48:42] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:48:49] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:49:16] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:49:23] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:49:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194180 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [11:49:43] (03CR) 10Ladsgroup: "Haven't tested it but beside these two comments, it looks good to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [11:49:53] RESOLVED: JobUnavailable: Reduced availability for job jmx_idp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:50:32] (03CR) 10Marostegui: [C:04-1] migrate.py: MariaDB version migration cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [11:50:35] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:50:48] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:55:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After upgrade to 10.11.14', diff saved to https://phabricator.wikimedia.org/P83646 and previous config saved to /var/cache/conftool/dbconfig/20251007-115513-root.json [11:55:38] (03CR) 10Phuedx: [C:03+1] "LGTM but see my comment inline about an inline comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194175 (https://phabricator.wikimedia.org/T406359) (owner: 10Milimetric) [11:56:50] !incidents [11:56:50] 6837 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:59:00] (03CR) 10Michael Große: [C:03+1] Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1200) [12:01:31] (03CR) 10Filippo Giunchedi: aptrepo: Add tofu package to trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [12:01:45] (03CR) 10Filippo Giunchedi: [C:03+1] Create /etc/wikimedia in the cloud VPS base class [puppet] - 10https://gerrit.wikimedia.org/r/1194156 (owner: 10Muehlenhoff) [12:03:59] (03CR) 10Michael Große: [C:04-1] "This is perfectly fine, but it must only be deployed after 1.45.0-wmf.23 reached production (estimated Thursday 16th of October)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [12:10:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After upgrade to 10.11.14', diff saved to https://phabricator.wikimedia.org/P83647 and previous config saved to /var/cache/conftool/dbconfig/20251007-121020-root.json [12:15:26] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host idp-test2005.wikimedia.org with OS trixie [12:19:23] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406524#11249785 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:23:46] !log rebalance Ganeti eqiad/C following vmscape reboots [12:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After upgrade to 10.11.14', diff saved to https://phabricator.wikimedia.org/P83649 and previous config saved to /var/cache/conftool/dbconfig/20251007-122526-root.json [12:36:53] (03CR) 10Federico Ceratto: clone_es.py: clone readonly es* hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [12:38:08] (03CR) 10FNegri: aptrepo: Add tofu package to trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [12:44:53] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:46:11] (03CR) 10Filippo Giunchedi: aptrepo: Add tofu package to trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [12:49:08] (03CR) 10Elukey: [C:03+1] Create /etc/wikimedia in the cloud VPS base class [puppet] - 10https://gerrit.wikimedia.org/r/1194156 (owner: 10Muehlenhoff) [12:58:43] (03PS1) 10Elukey: admin: add new ssh key for elukey [puppet] - 10https://gerrit.wikimedia.org/r/1194191 [13:00:06] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1300). nyaa~ [13:00:06] edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:52] o/ [13:01:11] o/ [13:01:25] edsanders: want to self-service? [13:01:37] yeah [13:01:52] (03PS3) 10Ottomata: sqoop - fix centralauth - use seperate script and add to sqoop-whole-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) [13:02:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194180 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [13:02:21] (03CR) 10Ottomata: "Sheesh, sorry for the stupid mistakes. I was desperately trying to get this done while also in meetings yesterday :/" [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [13:03:07] (03PS4) 10Ottomata: sqoop - fix centralauth - use seperate script and add to sqoop-whole-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) [13:03:15] (03Merged) 10jenkins-bot: Invalidate Flow cache on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194180 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [13:03:49] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1194180|Invalidate Flow cache on enwiktionary (T405080)]] [13:03:52] T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080 [13:05:47] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [13:06:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11249917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm executed with errors: -... [13:06:13] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:06:16] (03CR) 10Muehlenhoff: [C:03+1] "The key has been verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1189435 (owner: 10Gmodena) [13:06:35] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:06:48] (03CR) 10Muehlenhoff: [C:03+2] admin: add sk-ssh-ed25519 key for gmodena [puppet] - 10https://gerrit.wikimedia.org/r/1189435 (owner: 10Gmodena) [13:08:16] !log esanders@deploy2002 esanders: Backport for [[gerrit:1194180|Invalidate Flow cache on enwiktionary (T405080)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:20] (03CR) 10Ssingh: "@bking@wikimedia.org / @rkemper@wikimedia.org: this needs your review before being merged. Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141) (owner: 10Ssingh) [13:09:33] !log esanders@deploy2002 esanders: Continuing with sync [13:09:54] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11249922 (10ssingh) >>! In T406166#11249385, @cmooney wrote: > @ssingh FYI I ran the //sre.dns.netbox// cookbook just now as it alerted on being a diff, it removed the... [13:10:16] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha1001.wikimedia.org [13:10:18] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [13:13:05] jhancock@cumin1002 provision (PID 188072) is awaiting input [13:13:56] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194180|Invalidate Flow cache on enwiktionary (T405080)]] (duration: 10m 07s) [13:13:59] T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080 [13:14:06] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha1001.wikimedia.org - sukhe@cumin1003" [13:14:10] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha1001.wikimedia.org - sukhe@cumin1003" [13:14:10] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:14:11] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha1001.wikimedia.org on all recursors [13:14:14] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha1001.wikimedia.org on all recursors [13:16:50] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha1001.wikimedia.org - sukhe@cumin1003" [13:16:54] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha1001.wikimedia.org - sukhe@cumin1003" [13:17:07] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:17:21] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1001.wikimedia.org with OS trixie [13:17:29] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:17:53] (03PS1) 10Elukey: profile::puppetserver::backup: add a backup for /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/1194192 (https://phabricator.wikimedia.org/T405580) [13:18:21] (03CR) 10CI reject: [V:04-1] profile::puppetserver::backup: add a backup for /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/1194192 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [13:18:38] (03CR) 10Elukey: "Tried to follow https://wikitech.wikimedia.org/wiki/Bacula#Adding_a_new_client, let me know if I missed anything important!" [puppet] - 10https://gerrit.wikimedia.org/r/1194192 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [13:18:48] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:19:11] (03PS2) 10Elukey: profile::puppetserver::backup: add a backup for /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/1194192 (https://phabricator.wikimedia.org/T405580) [13:25:25] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167#11249998 (10ssingh) 05Open→03Resolved a:03ssingh `hcaptcha200[1-2].wikimedia.org` are ready. [13:27:47] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage [13:28:17] (03CR) 10Tiziano Fogli: "Yeah, it seems to be OK now." [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [13:28:57] !log rebalance Ganeti codfw/D following vmscape reboots [13:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:08] jouncebot: nowandnext [13:29:08] For the next 0 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1300) [13:29:09] In 0 hour(s) and 30 minute(s): Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1400) [13:29:30] if edsanders is done deploying, maybe I can try backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/1194190 right away 🤔 [13:29:39] (any objection Amir1?) [13:29:39] I'm done [13:30:06] none on my side but the gate submit is red it seems [13:30:08] bah, gate-and-submit is failing anyway [13:30:16] Access level to MediaWiki\Extension\CentralAuth\Special\SpecialGlobalGroupMembership::showLogFragment() must be protected (as in class MediaWiki\SpecialPage\UserGroupsSpecialPage) or weaker in /workspace/src/extensions/CentralAuth/includes/Special/SpecialGlobalGroupMembership.php [13:31:09] jmm@cumin2002 reimage (PID 465424) is awaiting input [13:31:53] caused by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1194117 [13:33:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11250019 (10Jhancock.wm) @elukey fixed cp2050. opening a ticket for cp2056 [13:34:17] (03PS1) 10Lucas Werkmeister (WMDE): Fix calls to incrementStatsKey() [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194193 (https://phabricator.wikimedia.org/T406569) [13:34:20] I’ll backport it anyway [13:34:25] gate-and-submit should be unbroken soon enough [13:34:29] and it should pass on the wmf branches [13:34:36] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [13:34:40] (03PS1) 10Lucas Werkmeister (WMDE): Fix calls to incrementStatsKey() [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194194 (https://phabricator.wikimedia.org/T406569) [13:35:00] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage [13:35:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [13:35:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194193 (https://phabricator.wikimedia.org/T406569) (owner: 10Lucas Werkmeister (WMDE)) [13:35:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194194 (https://phabricator.wikimedia.org/T406569) (owner: 10Lucas Werkmeister (WMDE)) [13:35:29] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [13:36:01] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:36:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [13:38:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [13:39:52] 06SRE, 06Traffic, 06MediaWiki-Platform-Team (Radar): Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11250057 (10CDanis) >>! In T221976#11248799, @Vgutierrez wrote: > The usual approach for HAProxy is to generate a UUID, append it to the... [13:41:20] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [13:41:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11250063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [13:43:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS bullseye [13:45:00] (03CR) 10Hnowlan: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194158 (owner: 10Muehlenhoff) [13:45:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [13:47:34] (03CR) 10Tiziano Fogli: "I’m not sure if this is the intended behavior, but thanos-rule@pilot will not be listed in /etc/thanos-query/stores/rule.yml (see modules/" [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [13:47:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11250100 (10cmooney) I'm actually not sure if this is going to be a possibility. Unfortunately the Nokia SR-Linux platfo... [13:48:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11250105 (10cmooney) See T405630#11250099, I'm not sure this will be possible. [13:48:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS bullseye [13:49:11] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1020.eqiad.wmnet with OS bullseye [13:50:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [13:51:06] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha1001.wikimedia.org with OS trixie [13:51:06] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha1001.wikimedia.org [13:51:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:32] !log jhancock@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2006.codfw.wmnet with reason: host reimage [13:52:49] (03CR) 10Majavah: [C:04-1] aptrepo: Add tofu package to trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [13:53:22] (03Merged) 10jenkins-bot: Fix calls to incrementStatsKey() [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194193 (https://phabricator.wikimedia.org/T406569) (owner: 10Lucas Werkmeister (WMDE)) [13:53:24] (03Merged) 10jenkins-bot: Fix calls to incrementStatsKey() [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194194 (https://phabricator.wikimedia.org/T406569) (owner: 10Lucas Werkmeister (WMDE)) [13:53:44] (03CR) 10Majavah: [C:03+1] "this seems ok, or having it in `profile::base` directly would work also" [puppet] - 10https://gerrit.wikimedia.org/r/1194156 (owner: 10Muehlenhoff) [13:54:00] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1194193|Fix calls to incrementStatsKey() (T406569)]], [[gerrit:1194194|Fix calls to incrementStatsKey() (T406569)]] [13:54:03] T406569: ArgumentCountError: Too few arguments to function Wikibase\Client\DataAccess\Scribunto\WikibaseLibrary::incrementStatsKey(), 1 passed in /srv/mediawiki/php-1.45.0-wmf.21/extensions/Scribunto/includes/Engines/LuaSandbox/LuaSandb - https://phabricator.wikimedia.org/T406569 [13:54:04] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [13:55:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2017.codfw.wmnet'] [13:55:07] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:56:34] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2050.codfw.wmnet'] [13:56:40] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha1002.wikimedia.org [13:56:41] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [13:58:17] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1194193|Fix calls to incrementStatsKey() (T406569)]], [[gerrit:1194194|Fix calls to incrementStatsKey() (T406569)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:58:37] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2006.codfw.wmnet with reason: host reimage [13:59:20] https://commons.wikimedia.org/wiki/User:Premeditated/sandbox is fixed (no longer crashes, though there are random lua errors) on mwdebug \o/ [13:59:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1400) [14:00:09] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha1002.wikimedia.org - sukhe@cumin1003" [14:00:41] I’m still deploying, sorry [14:00:42] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha1002.wikimedia.org - sukhe@cumin1003" [14:00:42] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:42] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha1002.wikimedia.org on all recursors [14:00:44] should be done soon though [14:00:46] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha1002.wikimedia.org on all recursors [14:01:49] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1020.eqiad.wmnet'] [14:03:00] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590 (10Neslihan_Turan_WMDE) 03NEW [14:03:59] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194193|Fix calls to incrementStatsKey() (T406569)]], [[gerrit:1194194|Fix calls to incrementStatsKey() (T406569)]] (duration: 09m 58s) [14:04:02] T406569: ArgumentCountError: Too few arguments to function Wikibase\Client\DataAccess\Scribunto\WikibaseLibrary::incrementStatsKey(), 1 passed in /srv/mediawiki/php-1.45.0-wmf.21/extensions/Scribunto/includes/Engines/LuaSandbox/LuaSandb - https://phabricator.wikimedia.org/T406569 [14:04:19] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592 (10seanleong-WMDE) 03NEW [14:04:50] !log UTC afternoon backport+config window done [14:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:46] (03CR) 10Eevans: [C:03+1] wmflib: discard new directory entries from swift_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) (owner: 10MVernon) [14:09:53] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:11:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1020.eqiad.wmnet'] [14:11:04] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018.eqiad.wmnet'] [14:11:20] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1020.eqiad.wmnet'] [14:15:07] (03CR) 10Muehlenhoff: aptrepo: Add tofu package to trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [14:15:11] elukey@cumin2002 upgrade-firmware (PID 478283) is awaiting input [14:16:20] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [14:16:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs1018.eqiad.wmnet'] [14:17:06] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['wdqs1020.eqiad.wmnet'] [14:19:25] jhancock@cumin1002 reimage (PID 244748) is awaiting input [14:21:15] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [14:21:16] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [14:21:23] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11250311 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm completed: - wikikube-ct... [14:21:36] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:21:40] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha1002.wikimedia.org - sukhe@cumin1003" [14:21:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11250314 (10Jhancock.wm) [14:21:44] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha1002.wikimedia.org - sukhe@cumin1003" [14:21:58] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1002.wikimedia.org with OS trixie [14:22:03] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [14:22:08] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11250317 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @Clement_Goubert this one is ready to go. thanks for your patience! [14:22:36] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [14:22:59] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:23:28] (03CR) 10Ssingh: "@vgutierrez@wikimedia.org: this could use your review, so whenever you have a chance please." [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [14:23:33] elukey@cumin2002 upgrade-firmware (PID 478283) is awaiting input [14:23:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11250327 (10Clement_Goubert) >>! In T400661#11250317, @Jhancock.wm wrote: > @Clement_Goubert this one is ready to go. thanks for your patience! Thanks a bunch <3 [14:24:49] (03CR) 10Jcrespo: [C:03+1] "All good from backups side, backups are encrypted on the wire and at rest, and we would audit what fileset would be included in a future p" [puppet] - 10https://gerrit.wikimedia.org/r/1194192 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [14:25:27] (03PS2) 10FNegri: aptrepo: Add tofu package to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) [14:25:47] 06SRE, 06serviceops: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596 (10Clement_Goubert) 03NEW [14:26:08] 06SRE, 06serviceops: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#11250355 (10Clement_Goubert) p:05Triage→03Medium [14:27:53] (03CR) 10FNegri: aptrepo: Add tofu package to trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [14:29:05] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1020.eqiad.wmnet'] [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1430) [14:31:29] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018.eqiad.wmnet'] [14:32:43] (03PS1) 10Elukey: Remove a deprecation warning for datetime in _menu.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/1194213 (https://phabricator.wikimedia.org/T401581) [14:34:17] (03PS1) 10Cathal Mooney: ssw1-e1-eqiad: Add EBGP peering to ssw1-d1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1194215 (https://phabricator.wikimedia.org/T402588) [14:34:30] (03CR) 10Volans: "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1194213 (https://phabricator.wikimedia.org/T401581) (owner: 10Elukey) [14:34:35] (03CR) 10Volans: [C:03+1] Remove a deprecation warning for datetime in _menu.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/1194213 (https://phabricator.wikimedia.org/T401581) (owner: 10Elukey) [14:41:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1020.eqiad.wmnet'] [14:42:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1018.eqiad.wmnet'] [14:42:22] (03CR) 10Ahmon Dancy: [C:03+1] Create /etc/wikimedia in the cloud VPS base class [puppet] - 10https://gerrit.wikimedia.org/r/1194156 (owner: 10Muehlenhoff) [14:42:59] (03Abandoned) 10Ahmon Dancy: osm_master: Create /etc/wikimedia directory [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [14:45:17] jouncebot: nowandnext [14:45:17] For the next 0 hour(s) and 14 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1430) [14:45:17] In 0 hour(s) and 14 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1500) [14:47:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11250491 (10elukey) I was able to provision and upgrade idrac+bios on 2050, thanks! [14:48:16] (03CR) 10Cathal Mooney: [C:03+2] ssw1-e1-eqiad: Add EBGP peering to ssw1-d1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1194215 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [14:48:32] (03CR) 10Jasmine: [C:03+2] wmnet: remove wikikube-ctrl1001 from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1193266 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [14:49:58] (03Merged) 10jenkins-bot: ssw1-e1-eqiad: Add EBGP peering to ssw1-d1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1194215 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [14:50:34] !log jasmine@dns1004 START - running authdns-update [14:51:02] jasmine_: I gather wikikube-ctrl1001 being decommissioned? [14:51:16] topranks: yeah [14:51:18] yes [14:51:34] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [14:51:46] !log jasmine@dns1004 END - running authdns-update [14:51:48] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug-next_4453: Servers wikikube-worker1268.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1135.eqiad.wmnet, wikikube-worker1143.eqiad.wm [14:51:48] ikube-worker1267.eqiad.wmnet, wikikube-worker1111.eqiad.wmnet, wikikube-worker1299.eqiad.wmnet, wikikube-worker1293.eqiad.wmnet, wikikube-worker1158.eqiad.wmnet, wikikube-worker1303.eqiad.wmnet, wikikube-worker1274.eqiad.wmnet, wikikube-worker1109.eqiad.wmnet, wikikube-worker1294.eqiad.wmnet, wikikube-worker1266.eqiad.wmnet, wikikube-worker1325.eqiad.wmnet, wikikube-worker1314.eqiad.wmnet, wikikube-worker1285.eqiad.wmnet, wikikube-worker1 [14:51:48] d.wmnet, wikikube-worker1252.eqiad.wmnet, wikikube-worker1261.eqiad.wmnet, wikikube-worker1064.eqiad.wmnet, wikikube-worker1264.eqiad.wmnet, wikikube-worker1061.eqiad.wmnet, wikikube-wo https://wikitech.wikimedia.org/wiki/PyBal [14:51:48] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug-next_4453: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1150.eqiad.wmnet, wikikube-worker1293.eqiad.wmnet, wikikube-worker1158.eqiad.wm [14:51:48] ikube-worker1065.eqiad.wmnet, wikikube-worker1120.eqiad.wmnet, wikikube-worker1117.eqiad.wmnet, wikikube-worker1271.eqiad.wmnet, wikikube-worker1048.eqiad.wmnet, wikikube-worker1165.eqiad.wmnet, wikikube-worker1109.eqiad.wmnet, wikikube-worker1294.eqiad.wmnet, wikikube-worker1020.eqiad.wmnet, wikikube-worker1325.eqiad.wmnet, wikikube-worker1262.eqiad.wmnet, wikikube-worker1285.eqiad.wmnet, wikikube-worker1246.eqiad.wmnet, wikikube-worker1 [14:51:48] d.wmnet, wikikube-worker1075.eqiad.wmnet, wikikube-worker1041.eqiad.wmnet, wikikube-worker1107.eqiad.wmnet, wikikube-worker1086.eqiad.wmnet, wikikube-worker1030.eqiad.wmnet, wikikube-wo https://wikitech.wikimedia.org/wiki/PyBal [14:51:57] FIRING: [2x] ProbeDown: Service mw-api-ext-next:4455 has failed probes (http_mw-api-ext-next_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:14] Huh that's not [14:52:28] dns? [14:52:28] Depool the host jasmine_ [14:52:32] I think we forgot that [14:52:37] x) [14:52:46] <_joe_> claime: are you on it? [14:52:48] ack, doing [14:53:02] it should be fine btw [14:53:10] !log jasmine@deploy2002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl1001.eqiad.wmnet [14:53:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:53:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11250521 (10VRiley-WMF) Yes, it seems like there is an issue with the fan, it is showing the warning lights for the fan. Is it okay to proceed w... [14:53:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193941 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [14:53:33] <_joe_> claime: looks like the mw-api-ext thing is unrelated? [14:53:39] _joe_: maybe [14:53:47] it's a weird coincidence [14:53:56] 5xx spike as well as latency [14:53:59] but yeah, db overload [14:54:17] <_joe_> yes [14:54:17] claime, jasmine_: I picked a bad time. but anyway when wikikube-ctrl1001 is removed from 'control-plane-nodes' for the cluster in hiera I can re-run the reverse DNS delegation script I have to fix up the NS entries that currently point toit [14:54:20] (03CR) 10CI reject: [V:04-1] Remove a deprecation warning for datetime in _menu.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/1194213 (https://phabricator.wikimedia.org/T401581) (owner: 10Elukey) [14:54:38] topranks: ack [14:54:41] (anything I can help with?) [14:54:53] FIRING: [2x] ProbeDown: Service mw-api-ext-next:4455 has failed probes (http_mw-api-ext-next_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:01] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11250537 (10elukey) [14:55:08] <_joe_> federico3: yeah take a look at the databases, which sections are overloaded? [14:55:08] cluster30/cluster31 circuit breaking [14:55:11] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS bookworm [14:55:53] (03PS1) 10Hashar: Disable motd banner: maintenance window has closed [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194220 (https://phabricator.wikimedia.org/T387833) [14:56:48] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:56:48] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:56:57] RESOLVED: [2x] ProbeDown: Service mw-api-ext-next:4455 has failed probes (http_mw-api-ext-next_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:57:02] I'm seeing s3 getting a lot of writes [14:57:18] but they all look like it's recovering [14:57:26] <_joe_> yes [14:57:29] (03CR) 10Elukey: [C:04-1] "it fails for older version of python, I'll add a workaround." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1194213 (https://phabricator.wikimedia.org/T401581) (owner: 10Elukey) [14:57:36] also throughout on s8 [14:57:43] jmm@cumin2002 reimage (PID 474543) is awaiting input [14:57:48] (03CR) 10Arnaudb: "lgtm!" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194220 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [14:58:04] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye [14:58:08] <_joe_> s8 can be by fetching wikidata items [14:58:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:58:43] (03CR) 10Arnaudb: [C:03+1] Disable motd banner: maintenance window has closed [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194220 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [14:59:11] RESOLVED: [2x] ProbeDown: Service mw-api-ext-next:4455 has failed probes (http_mw-api-ext-next_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:33] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on phab2002.codfw.wmnet,phab[1004-1005].eqiad.wmnet with reason: T406597 [14:59:37] T406597: Deploy Phabricator/Phorge 2025-10-07 - https://phabricator.wikimedia.org/T406597 [15:00:00] (03CR) 10Hashar: [V:03+2 C:03+2] Disable motd banner: maintenance window has closed [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194220 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [15:00:05] jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1500). nyaa~ [15:01:20] (03Merged) 10jenkins-bot: Disable motd banner: maintenance window has closed [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194220 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [15:01:23] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018.eqiad.wmnet'] [15:01:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11250579 (10Papaul) [15:01:49] !log brennen@deploy2002 Started deploy [phabricator/deployment@f2d2c87]: deploy phab2002 for T406597 [15:02:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11250581 (10Papaul) [15:02:19] !log brennen@deploy2002 Finished deploy [phabricator/deployment@f2d2c87]: deploy phab2002 for T406597 (duration: 00m 31s) [15:02:41] !log brennen@deploy2002 Started deploy [phabricator/deployment@f2d2c87]: deploy phab1004 for T406597 [15:03:07] !log hashar@deploy2002 Started deploy [gerrit/gerrit@21d2848]: Disable motd banner: maintenance window has closed - T387833 [15:03:19] T387833: Gerrit failover process - https://phabricator.wikimedia.org/T387833 [15:03:33] !log brennen@deploy2002 Finished deploy [phabricator/deployment@f2d2c87]: deploy phab1004 for T406597 (duration: 00m 52s) [15:03:37] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@21d2848]: Disable motd banner: maintenance window has closed - T387833 (duration: 00m 30s) [15:03:49] (03PS2) 10Milimetric: Configure a web_base_with_ip stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194175 (https://phabricator.wikimedia.org/T406359) [15:04:21] (03CR) 10Milimetric: Configure a web_base_with_ip stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194175 (https://phabricator.wikimedia.org/T406359) (owner: 10Milimetric) [15:05:03] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Add missing REST Gateway for Beta Cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [15:09:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1018.eqiad.wmnet'] [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:45] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2044.codfw.wmnet with OS bookworm [15:10:48] !incidents [15:10:48] 6839 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [15:10:49] 6837 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [15:11:14] !log homer ‘cr*eqiad’ commit "T383227" [15:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:17] T383227: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13] - https://phabricator.wikimedia.org/T383227 [15:11:50] (03CR) 10Majavah: [C:03+1] aptrepo: Add tofu package to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [15:13:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1020.eqiad.wmnet with OS bullseye [15:16:15] (03CR) 10Hnowlan: "This lgtm in theory if all that's desired is rerouting the base /v1/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190753 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [15:18:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11250698 (10RobH) a:05RobH→03klausman @klausman, Can you provide feedback on when we can migrate these hosts from one network port to the new network port?... [15:19:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11250715 (10cmooney) [15:20:06] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:21:26] (03PS1) 10Hashar: Disable component rather than motd plugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194225 (https://phabricator.wikimedia.org/T387833) [15:21:35] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:22:43] (03CR) 10Hashar: [C:03+2] Disable component rather than motd plugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194225 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [15:22:50] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:23:28] (03Merged) 10jenkins-bot: Disable component rather than motd plugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194225 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [15:23:30] that would fix the Gerrit popup "Timeout when loading plugins: wm-motd" [15:23:51] !log hashar@deploy2002 Started deploy [gerrit/gerrit@d0c47da]: Disable component rather than motd plugin [15:23:56] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2044.codfw.wmnet with reason: host reimage [15:24:02] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@d0c47da]: Disable component rather than motd plugin (duration: 00m 11s) [15:24:17] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:24:32] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:24:49] (03PS2) 10Federico Ceratto: migrate.py: MariaDB version migration cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [15:25:05] (03CR) 10Jgiannelos: osm: refactor swift scripts and make event-template dynamic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:25:14] (03CR) 10Jgiannelos: "Overall looks OK" [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:26:01] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:26:43] jasmine@cumin1003 decommission (PID 1692987) is awaiting input [15:29:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2044.codfw.wmnet with reason: host reimage [15:29:18] !log jasmine@cumin1003 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl1001.eqiad.wmnet [15:31:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11250779 (10RobH) a:05KOfori→03Kappakayala @Kappakayala, This work is slated to start after Oct 15th and extend through the end of the month. **#data-persistenc... [15:32:02] (03CR) 10CI reject: [V:04-1] migrate.py: MariaDB version migration cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [15:32:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:32:26] jasmine@cumin1003 decommission (PID 1692987) is awaiting input [15:32:35] 10ops-codfw, 06DC-Ops, 10Wikidata, 10Wikidata-Platform, and 2 others: wdqs2017: Apparent hardware issue, rack C2 - https://phabricator.wikimedia.org/T406609 (10bking) 03NEW [15:32:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:33:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11250801 (10bking) Grabbing the ticket per IRC conversation with @Jhancock.wm [15:34:53] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:06] (03PS1) 10Bking: dse-k8s-worker2003: move back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194231 (https://phabricator.wikimedia.org/T399778) [15:35:58] (03CR) 10Clément Goubert: [C:03+1] wikikube: decom control plane wikikube-ctrl1001 [puppet] - 10https://gerrit.wikimedia.org/r/1186006 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [15:36:29] (03CR) 10Bking: [C:03+2] dse-k8s-worker2003: move back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194231 (https://phabricator.wikimedia.org/T399778) (owner: 10Bking) [15:36:43] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1194231 (https://phabricator.wikimedia.org/T399778) (owner: 10Bking) [15:36:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11250820 (10RobH) @BCornwall, I wanted to get your feedback on this as we start the migrations within a couple of weeks. With my understanding of the cp cluster, my proposal is to... [15:37:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:38:40] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [15:38:44] (03CR) 10Jasmine: [C:03+2] wikikube: decom control plane wikikube-ctrl1001 [puppet] - 10https://gerrit.wikimedia.org/r/1186006 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [15:38:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), and 2 others: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11250827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host dse-k8s-worker2003.cod... [15:39:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11250829 (10RobH) @btullis, This is by far the most detailed overview of the host lists provided so far, thank you! I'll review the above l... [15:40:26] !log jasmine@cumin1003 START - Cookbook sre.dns.netbox [15:40:35] (03CR) 10Slyngshede: [C:03+1] "LGTM, key verified out of band." [puppet] - 10https://gerrit.wikimedia.org/r/1194191 (owner: 10Elukey) [15:41:29] slyngs: Would you be willing to review and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192567 ? [15:42:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11250832 (10cmooney) >>! In T405623#11250820, @RobH wrote: > LVS: This two hosts are a bit more tricky as I'm guessing we need to fully depool an lvs host before we touch its network.... [15:42:25] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha1002.wikimedia.org with OS trixie [15:42:25] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha1002.wikimedia.org [15:43:06] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:43:31] (03CR) 10Elukey: osm: refactor swift scripts and make event-template dynamic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:43:46] (03CR) 10Elukey: [C:03+2] admin: add new ssh key for elukey [puppet] - 10https://gerrit.wikimedia.org/r/1194191 (owner: 10Elukey) [15:43:49] (03PS1) 10Herron: vopsbot: update systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1194234 [15:44:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11250842 (10RobH) @LSobanski, Is there anything else I can provide to assist in getting feedback on the host list in the task description for the... [15:44:44] dancy: Do we know if there's some exception for this, normally sudo rules needs to be approve on an IF team meeting. [15:45:07] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:46:09] jasmine@cumin1003 decommission (PID 1692987) is awaiting input [15:46:10] slyngs: These are fine [15:46:25] dancy: Then yes, just a few minutes [15:46:32] Thanks! [15:46:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11250846 (10RobH) [15:47:03] !log jasmine@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1003" [15:47:03] (03CR) 10Herron: [C:03+2] vopsbot: update systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1194234 (owner: 10Herron) [15:47:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11250847 (10RobH) [15:47:16] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11250848 (10RobH) [15:47:44] !log jasmine@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1003" [15:47:44] !log jasmine@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:47:45] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl1001.eqiad.wmnet [15:47:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11250849 (10RobH) [15:47:49] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11250850 (10RobH) [15:48:53] (03CR) 10Slyngshede: [C:03+1] Allow deployment group to sudo -u mwbuilder scap clean-images [puppet] - 10https://gerrit.wikimedia.org/r/1192567 (https://phabricator.wikimedia.org/T387927) (owner: 10Ahmon Dancy) [15:49:14] (03CR) 10Slyngshede: [C:03+2] Allow deployment group to sudo -u mwbuilder scap clean-images [puppet] - 10https://gerrit.wikimedia.org/r/1192567 (https://phabricator.wikimedia.org/T387927) (owner: 10Ahmon Dancy) [15:49:37] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [15:49:54] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker2003.codfw.wmnet with reason: host reimage [15:52:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [15:52:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2044.codfw.wmnet with OS bookworm [15:53:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11250904 (10LSobanski) a:05LSobanski→03cmooney @RobH here's a summary of what needs to happen with the hosts, @cmooney will be coordinating th... [15:53:40] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11250907 (10dancy) 05Open→03Resolved We now have a systemd timer which runs `scap clean-images` weekly. And users in the `deployment` group can now run `scap clean-images` successfully. [15:53:45] 06SRE, 06serviceops-radar, 06Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#11250909 (10dancy) 05Open→03Resolved a:03dancy We now have a systemd timer which runs `scap clean-images` weekly. And users in the `deployment` grou... [15:55:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:55:45] topranks: node has been removed, feel free to re-run the reverse DNS delegation script at any point, ty! [15:56:07] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1192616 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:56:10] (03CR) 10Scott French: [C:03+2] P:conftool::requestctl_client: update requestctl_cli.original.py [puppet] - 10https://gerrit.wikimedia.org/r/1192616 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:57:01] (03CR) 10Dzahn: [C:03+1] phabricator: delay pages by 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1194108 (https://phabricator.wikimedia.org/T406338) (owner: 10Jelto) [15:57:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker2003.codfw.wmnet with reason: host reimage [15:58:03] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2044.codfw.wmnet with OS bullseye [15:58:13] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:59:55] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha1002.wikimedia.org [15:59:57] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [16:00:05] jhathaway and moritzm: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:15] (03CR) 10Scott French: [C:03+2] P:conftool::hiddenparma: enable known_client_expression_validation [puppet] - 10https://gerrit.wikimedia.org/r/1192620 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [16:01:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:01:35] dancy: Puppet is a little slow on the deploy hosts. deploy1003 should be good now and deploy2002 in a few minutes. [16:01:55] Awesome. Thanks for the help. [16:02:20] Anytime, I have to run, so hopefully it works out :-) [16:03:46] Farewell! [16:03:59] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:04:03] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha1002.wikimedia.org [16:04:11] yeah multiple IP assignments [16:04:35] bking@cumin2002 reimage (PID 516768) is awaiting input [16:05:36] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha1002.wikimedia.org [16:06:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:06:16] (03PS2) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [16:09:42] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [16:11:00] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1193958/7213/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:11:39] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2044.codfw.wmnet with OS bullseye [16:13:16] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [16:13:35] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [16:13:35] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:13:36] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha1002.wikimedia.org [16:13:42] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167#11251056 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `hcaptcha1002.wikimedia.org` - hcaptcha1002.wikimedia.org (... [16:15:13] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [16:15:37] (03PS3) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [16:18:19] bking@cumin2002 reimage (PID 529588) is awaiting input [16:20:42] (03CR) 10FNegri: [C:03+2] aptrepo: Add tofu package to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1194167 (https://phabricator.wikimedia.org/T405742) (owner: 10FNegri) [16:20:48] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:24:29] (03CR) 10Jgiannelos: osm: refactor swift scripts and make event-template dynamic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [16:25:47] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1020.eqiad.wmnet with OS bullseye [16:26:19] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1020.eqiad.wmnet with OS bullseye [16:28:30] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1193958/7214/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:30:47] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:39:02] 10ops-eqiad, 06DC-Ops, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs1020: Hanging during partitioning step of installation, rack E2 - https://phabricator.wikimedia.org/T406617 (10bking) 03NEW [16:39:58] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406618 (10phaultfinder) 03NEW [16:40:50] 10ops-eqiad, 06DC-Ops, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs1020: Hanging during partitioning step of installation, rack E2 - https://phabricator.wikimedia.org/T406617#11251144 (10bking) [16:44:53] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:45:39] (03CR) 10JHathaway: sre.hardware.upgrade-firmware: fix ssd upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T1700) [17:00:59] 10ops-codfw, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: wdqs2017: Apparent hardware issue, rack C2 - https://phabricator.wikimedia.org/T406609#11251311 (10Jhancock.wm) a:03Jhancock.wm [17:04:50] !log releases2003 - re-enabling puppet - reacting to monitoring alert - T405352 [17:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:09] T405352: APT error when installing Jenkins package in releases instances - https://phabricator.wikimedia.org/T405352 [17:09:28] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406618#11251329 (10Jclark-ctr) a:03Jclark-ctr [17:09:59] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406618#11251332 (10phaultfinder) [17:10:26] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406618#11251336 (10Jclark-ctr) 05Open→03Resolved [17:10:56] 06SRE: nodesource node22 apt mirror is broken - https://phabricator.wikimedia.org/T406623 (10taavi) 03NEW [17:11:46] (03PS4) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:11:50] (03PS1) 10Majavah: aptrepo: Disable node22/Trixie nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1194252 (https://phabricator.wikimedia.org/T406623) [17:12:16] (03CR) 10CI reject: [V:04-1] aptrepo: Disable node22/Trixie nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1194252 (https://phabricator.wikimedia.org/T406623) (owner: 10Majavah) [17:12:22] (03PS2) 10Majavah: aptrepo: Disable node22/Trixie nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1194252 (https://phabricator.wikimedia.org/T406623) [17:13:19] (03CR) 10FNegri: [C:03+1] aptrepo: Disable node22/Trixie nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1194252 (https://phabricator.wikimedia.org/T406623) (owner: 10Majavah) [17:13:32] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:13:50] (03CR) 10Majavah: [C:03+2] aptrepo: Disable node22/Trixie nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1194252 (https://phabricator.wikimedia.org/T406623) (owner: 10Majavah) [17:15:47] (03PS5) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:17:05] bking@cumin2002 reimage (PID 554703) is awaiting input [17:19:26] (03PS1) 10Majavah: aptrepo: Remove node22/Trixie from the correct list [puppet] - 10https://gerrit.wikimedia.org/r/1194253 (https://phabricator.wikimedia.org/T406623) [17:20:17] (03CR) 10Majavah: [C:03+2] aptrepo: Remove node22/Trixie from the correct list [puppet] - 10https://gerrit.wikimedia.org/r/1194253 (https://phabricator.wikimedia.org/T406623) (owner: 10Majavah) [17:21:33] (03PS4) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [17:22:02] (03CR) 10CI reject: [V:04-1] zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:24:49] (03PS5) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [17:26:41] !log taavi@apt1002 ~ $ sudo -i reprepro -C thirdparty/tofu update trixie-wikimedia # T405742 [17:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:45] T405742: tofu-provisioning: Failed to install provider - https://phabricator.wikimedia.org/T405742 [17:29:27] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha1002.wikimedia.org [17:29:29] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [17:29:58] (03PS6) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:30:52] (03PS6) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [17:32:54] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha1002.wikimedia.org - sukhe@cumin1003" [17:33:08] dse-k8s-worker2003.mgmt.codfw.wmnet: [17:33:09] merging this [17:33:34] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha1002.wikimedia.org - sukhe@cumin1003" [17:33:34] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:33:34] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha1002.wikimedia.org on all recursors [17:33:37] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha1002.wikimedia.org on all recursors [17:34:05] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha1002.wikimedia.org - sukhe@cumin1003" [17:34:09] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha1002.wikimedia.org - sukhe@cumin1003" [17:34:22] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1002.wikimedia.org with OS trixie [17:35:12] (03PS2) 10Scott French: mw-*: Tune 8.3 releases to prevent deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192954 (https://phabricator.wikimedia.org/T405955) [17:35:12] (03PS1) 10Scott French: mw-*: Right-size large service after switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194256 (https://phabricator.wikimedia.org/T405955) [17:37:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11251466 (10Jclark-ctr) @BTullis For an-test-master1002 we need to failover to it self when we move it or is that a typo? [17:39:52] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406626 (10phaultfinder) 03NEW [17:40:56] (03PS7) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [17:43:45] (03PS5) 10Jdlrobson: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T406627) (owner: 10LorenMora) [17:44:03] (03PS2) 10Scott French: mw-*: Right-size large service after switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194256 (https://phabricator.wikimedia.org/T405955) [17:46:28] 06SRE, 10Hiddenparma, 13Patch-For-Review: FY25/26 WE4.3.1: edge uniques in requestctl - https://phabricator.wikimedia.org/T400753#11251524 (10CDanis) 05Open→03Resolved [17:47:37] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha1002.wikimedia.org with reason: host reimage [17:53:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha1002.wikimedia.org with reason: host reimage [17:55:56] (03PS1) 10Gergő Tisza: session: Log actual class name in preventSessionsForUser exception [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194261 (https://phabricator.wikimedia.org/T406566) [17:56:46] (03PS8) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [17:57:10] (03PS1) 10Gergő Tisza: session: Log actual class name in preventSessionsForUser exception [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194262 (https://phabricator.wikimedia.org/T406566) [17:57:17] (03CR) 10CI reject: [V:04-1] zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:58:16] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13] - https://phabricator.wikimedia.org/T383227#11251601 (10jasmine_) [17:58:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194261 (https://phabricator.wikimedia.org/T406566) (owner: 10Gergő Tisza) [17:58:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194262 (https://phabricator.wikimedia.org/T406566) (owner: 10Gergő Tisza) [17:59:05] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13] - https://phabricator.wikimedia.org/T383227#11251604 (10jasmine_) >>! In T383227#11251360, @Jclark-ctr wrote: > @Jasmin when you finish these... [18:02:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Do something with cloudcontrol100[8-10]-dev - https://phabricator.wikimedia.org/T406630 (10Andrew) 03NEW [18:07:08] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194263 [18:08:57] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha1002.wikimedia.org with OS trixie [18:08:58] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha1002.wikimedia.org [18:14:51] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406626#11251671 (10phaultfinder) [18:20:47] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:24:55] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406626#11251711 (10phaultfinder) [18:24:58] (03PS3) 10Ssingh: haptcha: add new role for hCaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) [18:32:30] 06SRE, 06Data-Engineering: Set up a working, usable dbt installation on stat boxes - https://phabricator.wikimedia.org/T406634 (10amastilovic) 03NEW [18:39:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [18:44:52] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406626#11251768 (10phaultfinder) [18:51:56] !incidents [18:51:56] 6839 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [18:51:56] 6837 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [18:58:03] (03PS15) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) [18:58:39] (03CR) 10Aude: [C:04-1] "do we need the User-Agent in the stream?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T406627) (owner: 10LorenMora) [19:00:18] (03CR) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [19:17:01] (03PS1) 10Pcoombe: Disable mobilefrontend on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194278 (https://phabricator.wikimedia.org/T406638) [19:23:21] (03CR) 10Muehlenhoff: haptcha: add new role for hCaptcha proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [19:28:20] (03PS4) 10Ssingh: haptcha: add new role for hCaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) [19:28:34] (03CR) 10Ssingh: haptcha: add new role for hCaptcha proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [19:36:40] (03CR) 10Jdlrobson: [C:03+1] Disable mobilefrontend on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194278 (https://phabricator.wikimedia.org/T406638) (owner: 10Pcoombe) [19:39:37] (03PS16) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [19:40:14] (03CR) 10Muehlenhoff: haptcha: add new role for hCaptcha proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [19:41:41] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1020.eqiad.wmnet with reason: host reimage [19:41:54] 06SRE, 10Wikidata, 10Wikidata-Query-Service, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: wdqs1020: Hanging during partitioning step of installation, rack E2 - https://phabricator.wikimedia.org/T406617#11251948 (10bking) 05Open→03Invalid p:05Triage→03Low a:03bking [19:45:11] 06SRE, 10Wikidata, 10Wikidata-Query-Service, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: wdqs1020: Hanging during partitioning step of installation, rack E2 - https://phabricator.wikimedia.org/T406617#11251960 (10bking) I left the install going for a few hours, and it eventually... [19:47:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1020.eqiad.wmnet with reason: host reimage [19:54:55] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406626#11252010 (10phaultfinder) [19:58:56] (03PS1) 10Gergő Tisza: session: Log cache write flags in `SessionStore::set()` [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194281 (https://phabricator.wikimedia.org/T405633) [19:59:03] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [19:59:23] (03PS1) 10Gergő Tisza: session: Log cache write flags in `SessionStore::set()` [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194282 (https://phabricator.wikimedia.org/T405633) [19:59:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194281 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [19:59:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194282 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T2000) [20:00:05] maryum, tgr, and AaronSchulz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:31] is anyone going to do a mediawiki config deployment? I am planning to do my deployment with spiderpig [20:02:44] spiderpig running [20:02:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193941 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [20:03:48] (03Merged) 10jenkins-bot: OATHAuth: Increase 2FA opt-in to 40% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193941 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [20:04:24] !log mstyles@deploy2002 Started scap sync-world: Backport for [[gerrit:1193941|OATHAuth: Increase 2FA opt-in to 40% of users (T399664)]] [20:04:28] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [20:04:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1020.eqiad.wmnet with OS bullseye [20:05:19] * AaronSchulz has one mw config patch [20:08:42] !log mstyles@deploy2002 mstyles: Backport for [[gerrit:1193941|OATHAuth: Increase 2FA opt-in to 40% of users (T399664)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:11] !log mstyles@deploy2002 mstyles: Continuing with sync [20:10:12] (03PS1) 10JHathaway: postfix: bump module to v3.1.6 [puppet] - 10https://gerrit.wikimedia.org/r/1194285 (https://phabricator.wikimedia.org/T406278) [20:11:47] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194285 (https://phabricator.wikimedia.org/T406278) (owner: 10JHathaway) [20:12:42] I'm going to add a config patch to this window [20:13:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13] - https://phabricator.wikimedia.org/T383227#11252077 (10Jclark-ctr) a:03Jclark-ctr [20:13:32] !log mstyles@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193941|OATHAuth: Increase 2FA opt-in to 40% of users (T399664)]] (duration: 09m 08s) [20:13:54] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [20:14:03] (03CR) 10CI reject: [V:04-1] session: Log cache write flags in `SessionStore::set()` [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194281 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [20:15:26] (03PS1) 10Bking: wdqs1020: move back to production role [puppet] - 10https://gerrit.wikimedia.org/r/1194286 (https://phabricator.wikimedia.org/T405978) [20:16:02] (03PS1) 10Kosta Harlan: CheckUser/UserInfoCard: Remove enable-by-default mode for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194287 [20:16:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194287 (owner: 10Kosta Harlan) [20:17:36] (03CR) 10JHathaway: [C:03+2] postfix: bump module to v3.1.6 [puppet] - 10https://gerrit.wikimedia.org/r/1194285 (https://phabricator.wikimedia.org/T406278) (owner: 10JHathaway) [20:20:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194286 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [20:20:55] AaronSchulz, feel free to deploy, I need to deal with some bogus CI issue [20:22:06] (03CR) 10Gergő Tisza: "`" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194281 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [20:22:14] (03CR) 10Gergő Tisza: "recheck" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194281 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [20:23:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194278 (https://phabricator.wikimedia.org/T406638) (owner: 10Pcoombe) [20:26:28] tgr_: I have a config patch as well, can I sync that before your wmf.21 patches? [20:26:53] sure [20:29:22] (03PS2) 10Bking: wdqs1020: move back to production role [puppet] - 10https://gerrit.wikimedia.org/r/1194286 (https://phabricator.wikimedia.org/T405978) [20:30:25] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time." [puppet] - 10https://gerrit.wikimedia.org/r/1194286 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [20:30:31] (03PS2) 10Kosta Harlan: CheckUser/UserInfoCard: Remove enable-by-default mode for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194287 (https://phabricator.wikimedia.org/T405342) [20:31:03] (03PS3) 10Kosta Harlan: CheckUser/UserInfoCard: Remove enable-by-default mode for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194287 (https://phabricator.wikimedia.org/T405342) [20:31:22] AaronSchulz: are you deploying, or shall I go? [20:31:44] kostajh: you can. I always brushing up on the commands. [20:31:48] ok [20:31:53] s/always/was [20:32:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194287 (https://phabricator.wikimedia.org/T405342) (owner: 10Kosta Harlan) [20:33:22] (03Merged) 10jenkins-bot: CheckUser/UserInfoCard: Remove enable-by-default mode for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194287 (https://phabricator.wikimedia.org/T405342) (owner: 10Kosta Harlan) [20:33:59] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1194287|CheckUser/UserInfoCard: Remove enable-by-default mode for dewiki (T405342)]] [20:34:02] T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342 [20:37:54] kostajh: let me know when you are done [20:38:20] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1194287|CheckUser/UserInfoCard: Remove enable-by-default mode for dewiki (T405342)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:39:31] will do [20:40:45] !log kharlan@deploy2002 kharlan: Continuing with sync [20:41:52] !log Enable unified mobile routing on all except en.wikipedia.org - T403510 [20:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:55] T403510: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [20:43:40] (03PS1) 10Ladsgroup: maintain-views: Add abusefilterblockeddomainhit to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/1194294 (https://phabricator.wikimedia.org/T406562) [20:45:03] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194287|CheckUser/UserInfoCard: Remove enable-by-default mode for dewiki (T405342)]] (duration: 11m 05s) [20:45:07] T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342 [20:45:41] AaronSchulz: over to you [20:47:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [20:48:07] (03Merged) 10jenkins-bot: Add restbase spec JSON files to which /rest_v1/?spec can be routed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [20:48:16] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer main graph to newly-reimaged host) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1020.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:48:35] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1175942|Add restbase spec JSON files to which /rest_v1/?spec can be routed (T397203 T396805)]] [20:48:42] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [20:48:48] T397203: [SPIKE] Propose a location for hosting RESTBase OpenAPI spec definitions - https://phabricator.wikimedia.org/T397203 [20:48:48] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [20:50:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11252319 (10wiki_willy) a:05cmooney→03VRiley-WMF [20:50:21] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [20:50:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [20:50:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11252322 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host dse-k8s-worker2003.c... [20:50:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11252323 (10wiki_willy) a:05cmooney→03VRiley-WMF [20:53:06] !log aaron@deploy2002 aaron: Backport for [[gerrit:1175942|Add restbase spec JSON files to which /rest_v1/?spec can be routed (T397203 T396805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:54:21] (03PS1) 10JHathaway: mx_outbound_hosts: helper function [puppet] - 10https://gerrit.wikimedia.org/r/1194297 [20:54:29] !log aaron@deploy2002 aaron: Continuing with sync [20:57:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11252329 (10cmooney) >>! In T406554#11250521, @VRiley-WMF wrote: > Yes, it seems like there is an issue with the fan, it is showing the warning... [20:58:49] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1175942|Add restbase spec JSON files to which /rest_v1/?spec can be routed (T397203 T396805)]] (duration: 10m 13s) [20:58:54] T397203: [SPIKE] Propose a location for hosting RESTBase OpenAPI spec definitions - https://phabricator.wikimedia.org/T397203 [20:58:54] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [20:58:59] (03CR) 10JHathaway: [C:03+2] mx_outbound_hosts: helper function [puppet] - 10https://gerrit.wikimedia.org/r/1194297 (owner: 10JHathaway) [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251007T2100) [21:03:22] (03PS1) 10JHathaway: civicrm: set postfix relay host to wikimedia's mx-out [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) [21:03:52] I'm done with my deploy. [21:04:25] will wait a few minutes in case web team wants to use their window [21:04:26] gerrit acting up for me now though. [21:06:35] (03CR) 10CI reject: [V:04-1] civicrm: set postfix relay host to wikimedia's mx-out [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) (owner: 10JHathaway) [21:08:51] (03PS2) 10JHathaway: civicrm: set postfix relay host to wikimedia's mx-out [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) [21:09:06] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) (owner: 10JHathaway) [21:11:09] (03CR) 10CI reject: [V:04-1] civicrm: set postfix relay host to wikimedia's mx-out [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) (owner: 10JHathaway) [21:12:45] (03CR) 10Tacsipacsi: Add a banner for a Gerrit switch over maintenance (031 comment) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [21:13:44] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [21:13:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194261 (https://phabricator.wikimedia.org/T406566) (owner: 10Gergő Tisza) [21:13:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194262 (https://phabricator.wikimedia.org/T406566) (owner: 10Gergő Tisza) [21:13:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194281 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [21:13:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194282 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [21:14:37] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:15:15] (03PS4) 10Aaron Schulz: [DNM] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) [21:15:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656 (10bking) 03NEW [21:16:42] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [21:16:43] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:17:38] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:17:45] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:17:51] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:18:39] (03PS5) 10Aaron Schulz: Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) [21:18:45] (03PS3) 10JHathaway: civicrm: set postfix relay host to wikimedia's mx-out [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) [21:18:52] (03Merged) 10jenkins-bot: session: Log actual class name in preventSessionsForUser exception [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194261 (https://phabricator.wikimedia.org/T406566) (owner: 10Gergő Tisza) [21:18:58] (03Merged) 10jenkins-bot: session: Log actual class name in preventSessionsForUser exception [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194262 (https://phabricator.wikimedia.org/T406566) (owner: 10Gergő Tisza) [21:20:24] (03PS5) 10Ssingh: haptcha: add new role for hCaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) [21:21:03] (03CR) 10Ssingh: haptcha: add new role for hCaptcha proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [21:21:28] (03PS6) 10Ssingh: haptcha: add new role for hCaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) [21:21:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) (owner: 10JHathaway) [21:22:18] I need to do a small beta cluster deployment [21:22:48] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [21:24:10] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D DC Ops host migrations - https://phabricator.wikimedia.org/T405021#11252411 (10RobH) p:05Triage→03Medium a:05RobH→03Jclark-ctr [21:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D DC Ops host migrations - https://phabricator.wikimedia.org/T405021#11252414 (10RobH) @Jclark-ctr, This sretest1002 located in D6 can be the first 'test' of the migration scripts since its a test host and all. [21:25:09] (03PS6) 10Aaron Schulz: Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) [21:25:55] Jdlrobson: the backports will take a bit, sorry [21:26:44] tgr_: no prob [21:27:00] (03PS1) 10Jdlrobson: Reconfigure labs survey to test embedElementId [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194301 (https://phabricator.wikimedia.org/T404152) [21:27:18] ^ that's the patch tgr_ . let me know when I can have the conch :) [21:28:07] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:29:59] (03Merged) 10jenkins-bot: session: Log cache write flags in `SessionStore::set()` [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194281 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [21:30:06] (03Merged) 10jenkins-bot: session: Log cache write flags in `SessionStore::set()` [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194282 (https://phabricator.wikimedia.org/T405633) (owner: 10Gergő Tisza) [21:30:44] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1194261|session: Log actual class name in preventSessionsForUser exception (T406566)]], [[gerrit:1194262|session: Log actual class name in preventSessionsForUser exception (T406566)]], [[gerrit:1194281|session: Log cache write flags in `SessionStore::set()` (T405633 T405634)]], [[gerrit:1194282|session: Log cache write flags in `SessionStore::set()` (T4056 [21:30:44] 33 T405634)]] [21:30:54] T406566: BadMethodCallException: MediaWiki\Session\SessionProvider::preventSessionsForUser must be implemented when canChangeUser() is false - https://phabricator.wikimedia.org/T406566 [21:30:54] T405633: Session data is authenticated, should not be an anonymous user - https://phabricator.wikimedia.org/T405633 [21:30:54] T405634: Authenticated data should not be in the anonymous store - https://phabricator.wikimedia.org/T405634 [21:30:55] T4056: Link to media file missing from description page in 1.5 - https://phabricator.wikimedia.org/T4056 [21:33:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11252463 (10VRiley-WMF) Hey @cmooney I just checked the filter, and it looked clean. I also reseated the fans as well, however it still is showi... [21:34:47] !log tgr@deploy2002 tgr: Backport for [[gerrit:1194261|session: Log actual class name in preventSessionsForUser exception (T406566)]], [[gerrit:1194262|session: Log actual class name in preventSessionsForUser exception (T406566)]], [[gerrit:1194281|session: Log cache write flags in `SessionStore::set()` (T405633 T405634)]], [[gerrit:1194282|session: Log cache write flags in `SessionStore::set()` (T405633 T405634)]] synced [21:34:47] to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:36:01] !log tgr@deploy2002 tgr: Continuing with sync [21:40:21] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194261|session: Log actual class name in preventSessionsForUser exception (T406566)]], [[gerrit:1194262|session: Log actual class name in preventSessionsForUser exception (T406566)]], [[gerrit:1194281|session: Log cache write flags in `SessionStore::set()` (T405633 T405634)]], [[gerrit:1194282|session: Log cache write flags in `SessionStore::set()` (T405 [21:40:21] 633 T405634)]] (duration: 09m 36s) [21:40:29] T406566: BadMethodCallException: MediaWiki\Session\SessionProvider::preventSessionsForUser must be implemented when canChangeUser() is false - https://phabricator.wikimedia.org/T406566 [21:40:30] T405633: Session data is authenticated, should not be an anonymous user - https://phabricator.wikimedia.org/T405633 [21:40:30] T405634: Authenticated data should not be in the anonymous store - https://phabricator.wikimedia.org/T405634 [21:40:31] T405: USER GROUP: Understanding and Build Up an Ecosystem - Compile List of Third Party Contacts - https://phabricator.wikimedia.org/T405 [21:40:34] Jdlrobson: I'm done [21:40:46] !log UTC late deploys done [21:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:34] thanks tgr_ [21:41:47] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T405978, transfer main graph to newly-reimaged host) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1020.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [21:41:50] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [21:41:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194301 (https://phabricator.wikimedia.org/T404152) (owner: 10Jdlrobson) [21:42:32] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1020 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:42:32] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1020 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:42:32] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:42:32] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:42:57] well.. that looks like a common thing we already know. let me restart blazegraph there [21:43:26] (03Merged) 10jenkins-bot: Reconfigure labs survey to test embedElementId [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194301 (https://phabricator.wikimedia.org/T404152) (owner: 10Jdlrobson) [21:44:44] mutante no worries, I Got it [21:45:05] inflatador: I tried but "wdqs-blazegraph.service: Start request repeated too quickly." [21:45:14] then I checked https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Resolving_blazegraph_deadlock [21:45:34] mutante yeah, sorry, I should have downtimed that guy ;( I just finished the data transfer and it needs a couple more steps [21:45:37] and it says to restart but not the full command [21:45:54] ah! gotcha. I thought it was a random crash [21:46:10] with blazegraph? Nah ;P [21:46:16] ;) [21:46:25] this is like the one time it's not actually a deadlock :P 99 times out of 100 your approach is perfect [21:46:39] haha, ok:) got you [21:47:14] (done) [21:48:42] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on wdqs1020.eqiad.wmnet with reason: finish getting host ready for production [21:49:45] !log bking@deploy2002 Started deploy [wdqs/wdqs@fea7794]: T405978 [21:49:48] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [21:50:29] !log bking@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: T405978 (duration: 00m 45s) [21:50:32] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1020 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:50:32] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:51:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:52:32] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1020 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:52:32] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:55:48] (03PS11) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [21:56:21] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://phabricator.wikimedia.org/T394844#10888302" [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:00:41] (03PS1) 10Jdlrobson: [beta] Allow displaying surveys on special pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194303 (https://phabricator.wikimedia.org/T404152) [22:01:06] sorry i missed something in above [22:01:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194303 (https://phabricator.wikimedia.org/T404152) (owner: 10Jdlrobson) [22:02:32] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=wdqs1020\.eqiad\.wmnet [22:03:03] (03Merged) 10jenkins-bot: [beta] Allow displaying surveys on special pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194303 (https://phabricator.wikimedia.org/T404152) (owner: 10Jdlrobson) [22:04:05] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye [22:06:00] 06SRE, 06collaboration-services, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q1): create a new place for prometheus/alertmanager checks not tied to physical machines - https://phabricator.wikimedia.org/T397264#11252624 (10Dzahn) [22:08:05] (03PS1) 10Jdlrobson: [labs] Move namespaces to audience definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194304 (https://phabricator.wikimedia.org/T404152) [22:09:53] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:11:48] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "running per cookbook error suggestion - bking@cumin2002 - T399778" [22:11:52] T399778: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778 [22:12:08] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "running per cookbook error suggestion - bking@cumin2002 - T399778" [22:15:47] (03PS1) 10Bking: dse-k8s-worker2003: return to production role [puppet] - 10https://gerrit.wikimedia.org/r/1194305 (https://phabricator.wikimedia.org/T399778) [22:17:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), and 2 others: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11252650 (10bking) I verified that this host is ready for production, so no need to worry about the above cookbook failure. [22:23:29] (03PS1) 10Dzahn: zuul: create class and systemd unit for new zuul-web service [puppet] - 10https://gerrit.wikimedia.org/r/1194306 (https://phabricator.wikimedia.org/T395938) [22:24:52] (03PS2) 10Dzahn: zuul: create class and systemd unit for new zuul-web service [puppet] - 10https://gerrit.wikimedia.org/r/1194306 (https://phabricator.wikimedia.org/T395938) [22:25:40] 10ops-esams, 06SRE, 06DC-Ops: esams: remove old lvs secondary links - https://phabricator.wikimedia.org/T396601#11252670 (10RobH) 05Open→03Resolved [22:35:53] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [22:36:16] (03CR) 10Dwisehaupt: [C:03+1] "I think this looks good. Tested in cloudvps and it does the right thing with an undef relayhost there." [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) (owner: 10JHathaway) [22:36:30] (03PS1) 10Jdlrobson: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194309 (https://phabricator.wikimedia.org/T317841) [22:46:14] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [22:55:00] ryankemper@cumin2002 reimage (PID 722343) is awaiting input [23:09:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:09:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:10:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:13:54] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:14:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:14:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:18:48] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:19:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:19:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:20:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:22:48] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:24:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:24:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:36:35] (03PS4) 10Aaron Schulz: Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) [23:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1194314 [23:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1194314 (owner: 10TrainBranchBot) [23:45:51] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:47:46] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:48:37] Hey all - need to do an emergency security deployment real quick for T406664 [23:50:54] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:51:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1194314 (owner: 10TrainBranchBot) [23:52:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:53:04] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:53:49] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:54:06] (03CR) 10BPirkle: [C:04-1] "One escaping change, otherwise looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [23:54:47] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:57:07] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:57:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:57:47] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:58:17] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [23:58:51] !log Deployed security mitigation for T406664 to 1.45.0-wmf.21 [23:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log