[00:09:01] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:09:48] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:11:44] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T381454 (10phaultfinder) 03NEW [00:13:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [00:13:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1084.eqiad.wmnet with OS bullseye [00:13:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378082 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-be1084.eqiad.wmnet with OS bullseye complete... [00:16:46] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:18:15] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1085.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:18:45] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1085.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:22:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:26:57] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:28:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:30:07] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2020.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [00:30:10] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [00:31:12] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1085.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:36:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1042.eqiad.wmnet with reason: host reimage [00:37:12] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:38:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100216 [00:38:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100216 (owner: 10TrainBranchBot) [00:40:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1042.eqiad.wmnet with reason: host reimage [00:41:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1085.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:42:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:42:12] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:42:50] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS13030/IPv4: Idle - Init7, AS13030/IPv6: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:43:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:43:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:45:05] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378111 (10VRiley-WMF) [00:47:22] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:47:53] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1085.eqiad.wmnet with OS bullseye [00:48:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378117 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-be1085.eqiad.wmnet with OS bullseye [00:48:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:48:29] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:50:22] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:51:40] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Connect - Init7, AS13030/IPv4: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:52:02] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Connect - Init7, AS13030/IPv4: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:52:51] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:53:19] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:54:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100216 (owner: 10TrainBranchBot) [00:54:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378121 (10VRiley-WMF) [00:55:16] (03PS1) 10Tim Starling: Prepare for migration of the Interwiki extension to core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100217 (https://phabricator.wikimedia.org/T33951) [00:56:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:57:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:57:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1042.eqiad.wmnet with OS bookworm [00:57:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1042.eqiad.wmnet with OS bookworm co... [01:00:07] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:00:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:01:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:02:50] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 67, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:02:52] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1041.eqiad.wmnet with OS bookworm [01:02:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:03:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1041.eqiad.wmnet with OS bookworm ex... [01:03:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1041.eqiad.wmnet with OS bookworm [01:03:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378129 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1041.eqiad.wmnet with OS bookworm [01:07:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T381173#10378133 (10VRiley-WMF) [01:08:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100219 [01:08:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100219 (owner: 10TrainBranchBot) [01:15:22] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [01:15:52] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [01:19:21] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1041.eqiad.wmnet with reason: host reimage [01:20:43] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1045.eqiad.wmnet with OS bookworm [01:20:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1046.eqiad.wmnet with OS bookworm [01:20:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm [01:21:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1046.eqiad.wmnet with OS bookworm [01:22:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1041.eqiad.wmnet with reason: host reimage [01:22:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:23:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:24:11] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T381454#10378140 (10VRiley-WMF) a:03VRiley-WMF [01:24:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: Decommission mc-gp100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T381173#10378138 (10VRiley-WMF) 05Open→03Resolved [01:25:44] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100219 (owner: 10TrainBranchBot) [01:28:02] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:28:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10378145 (10VRiley-WMF) a:03VRiley-WMF [01:36:47] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1046.eqiad.wmnet with reason: host reimage [01:36:58] PROBLEM - BFD status on cr2-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:37:02] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:39:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1046.eqiad.wmnet with reason: host reimage [01:39:58] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:42:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1044.eqiad.wmnet with OS bookworm [01:42:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm ex... [01:46:08] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [01:46:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378150 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm ex... [01:52:54] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T381454#10378151 (10VRiley-WMF) 05Open→03Resolved Reseated Power supply [01:54:28] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:56:33] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:01:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1012 / ganeti1022 - https://phabricator.wikimedia.org/T381385#10378156 (10VRiley-WMF) [02:06:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:07:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:08:09] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1085.eqiad.wmnet with OS bullseye [02:08:14] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378157 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-be1085.eqiad.wmnet with OS bullseye executed... [02:09:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1012 / ganeti1022 - https://phabricator.wikimedia.org/T381385#10378158 (10VRiley-WMF) [02:09:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1012 / ganeti1022 - https://phabricator.wikimedia.org/T381385#10378159 (10VRiley-WMF) 05Open→03Resolved [02:12:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10378160 (10phaultfinder) [02:18:43] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381268#10378161 (10VRiley-WMF) [02:32:00] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381268#10378169 (10VRiley-WMF) [02:32:16] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381268#10378170 (10VRiley-WMF) 05Open→03Resolved [02:32:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:32:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1041.eqiad.wmnet with OS bookworm [02:32:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1041.eqiad.wmnet with OS bookworm co... [02:33:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:33:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1046.eqiad.wmnet with OS bookworm [02:33:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378173 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1046.eqiad.wmnet with OS bookworm co... [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:50] (03PS4) 10Srishakatux: Add new namespaces to hsb wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) [02:40:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1045.eqiad.wmnet with OS bookworm [02:41:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10378174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm ex... [02:42:02] RECOVERY - BFD status on cr2-esams is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:42:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:44:06] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:45:10] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 110, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:54:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:55:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10378186 (10VRiley-WMF) [02:56:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1015 / ganeti1021 - https://phabricator.wikimedia.org/T381157#10378187 (10VRiley-WMF) 05Open→03Resolved [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:22:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:30:10] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:52:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10378202 (10phaultfinder) [04:02:08] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:15:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:18:52] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:19:02] RECOVERY - BFD status on cr1-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:19:12] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 398710024 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:20:12] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 12072 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:31:26] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:38:06] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:38:10] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:38:10] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:38:12] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:54:28] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:08:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 10%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71507 and previous config saved to /var/cache/conftool/dbconfig/20241204-060808-root.json [06:08:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 10%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71508 and previous config saved to /var/cache/conftool/dbconfig/20241204-060834-root.json [06:10:26] (03PS1) 10Marostegui: instances.yaml: Add es2042 [puppet] - 10https://gerrit.wikimedia.org/r/1100232 (https://phabricator.wikimedia.org/T381259) [06:16:22] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2042 [puppet] - 10https://gerrit.wikimedia.org/r/1100232 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [06:18:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2042 to dbctl depooled T381259', diff saved to https://phabricator.wikimedia.org/P71509 and previous config saved to /var/cache/conftool/dbconfig/20241204-061821-marostegui.json [06:18:25] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [06:23:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 25%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71510 and previous config saved to /var/cache/conftool/dbconfig/20241204-062313-root.json [06:23:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 25%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71511 and previous config saved to /var/cache/conftool/dbconfig/20241204-062339-root.json [06:31:20] (03PS2) 10Abijeet Patro: Translate: Enable message group subscription for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100352 (https://phabricator.wikimedia.org/T372386) [06:38:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 50%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71512 and previous config saved to /var/cache/conftool/dbconfig/20241204-063819-root.json [06:38:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 50%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71513 and previous config saved to /var/cache/conftool/dbconfig/20241204-063844-root.json [06:44:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100352 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [06:51:51] (03CR) 10Arnaudb: "convolutions flattened, one question still open" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [06:53:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 75%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71514 and previous config saved to /var/cache/conftool/dbconfig/20241204-065324-root.json [06:53:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 75%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71515 and previous config saved to /var/cache/conftool/dbconfig/20241204-065349-root.json [06:56:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T0700) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:08:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 100%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71516 and previous config saved to /var/cache/conftool/dbconfig/20241204-070829-root.json [07:08:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 100%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71517 and previous config saved to /var/cache/conftool/dbconfig/20241204-070855-root.json [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:31:40] (03CR) 10KCVelaga: Add Metrics Platform stream configuration for translate_extension (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [07:35:20] (03CR) 10Slyngshede: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1100132 (owner: 10Muehlenhoff) [07:35:46] (03PS4) 10Wangombe: Add Metrics Platform stream configuration for translate_extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) [07:36:04] (03CR) 10Wangombe: Add Metrics Platform stream configuration for translate_extension (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [07:36:42] (03CR) 10Slyngshede: [C:04-1] Extend access request email template (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1100133 (owner: 10Muehlenhoff) [07:37:46] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T381464 (10Cpetrillo) 03NEW [07:45:03] (03PS1) 10Marostegui: es2042: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100387 [07:46:02] (03PS1) 10Slyngshede: Updated notification handling [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) [07:46:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71518 and previous config saved to /var/cache/conftool/dbconfig/20241204-074629-root.json [07:46:34] (03PS1) 10Jelto: trafficserver: switch query-scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1100389 (https://phabricator.wikimedia.org/T350793) [07:46:35] (03CR) 10Marostegui: [C:03+2] es2042: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100387 (owner: 10Marostegui) [07:46:43] (03CR) 10Slyngshede: [C:03+2] Fix typo in SUL reminder [software/bitu] - 10https://gerrit.wikimedia.org/r/1100132 (owner: 10Muehlenhoff) [07:47:03] (03CR) 10Jelto: [C:03+2] "re-revert: Ic67e16343ecb4deb58e9ba2019af0468bf99e13a" [puppet] - 10https://gerrit.wikimedia.org/r/1098891 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [07:48:46] (03PS1) 10Marostegui: instances.yaml: Add es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1100390 (https://phabricator.wikimedia.org/T381259) [07:49:16] (03Merged) 10jenkins-bot: Fix typo in SUL reminder [software/bitu] - 10https://gerrit.wikimedia.org/r/1100132 (owner: 10Muehlenhoff) [07:51:13] (03CR) 10Muehlenhoff: New ferm rule to permit HDFS data flows and mark as low-prio for qos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [07:52:16] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1100390 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [07:54:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2046 to es5 depooled T381259', diff saved to https://phabricator.wikimedia.org/P71519 and previous config saved to /var/cache/conftool/dbconfig/20241204-075427-marostegui.json [07:54:31] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [07:55:24] (03PS1) 10Marostegui: es2046: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100391 (https://phabricator.wikimedia.org/T381259) [07:56:09] (03CR) 10Marostegui: [C:03+2] es2046: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100391 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [07:57:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 1%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71520 and previous config saved to /var/cache/conftool/dbconfig/20241204-075703-root.json [07:58:15] (03PS1) 10Slyngshede: Release v0.1.3 [software/bitu] - 10https://gerrit.wikimedia.org/r/1100393 [07:58:25] (03CR) 10Jelto: [C:03+2] trafficserver: switch query-scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1100389 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [07:59:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1100393 (owner: 10Slyngshede) [08:00:04] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T0800). [08:00:04] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:09] hello [08:00:26] I'll deploy [08:00:54] (03CR) 10Slyngshede: [C:03+2] Release v0.1.3 [software/bitu] - 10https://gerrit.wikimedia.org/r/1100393 (owner: 10Slyngshede) [08:01:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100117 (https://phabricator.wikimedia.org/T381189) (owner: 10Kosta Harlan) [08:01:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 25%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71522 and previous config saved to /var/cache/conftool/dbconfig/20241204-080134-root.json [08:03:37] (03Merged) 10jenkins-bot: Release v0.1.3 [software/bitu] - 10https://gerrit.wikimedia.org/r/1100393 (owner: 10Slyngshede) [08:05:13] (03PS16) 10Arnaudb: mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [08:05:13] (03CR) 10Arnaudb: "tests are written down" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [08:11:20] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:12:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71524 and previous config saved to /var/cache/conftool/dbconfig/20241204-081208-root.json [08:12:11] (03Merged) 10jenkins-bot: dialog: Don't duplicate the footer in the behaviour list template [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100117 (https://phabricator.wikimedia.org/T381189) (owner: 10Kosta Harlan) [08:13:18] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1100117|dialog: Don't duplicate the footer in the behaviour list template (T381189)]] [08:13:20] T381189: Footer text on types of unacceptable behavior step is not in dialog footer - https://phabricator.wikimedia.org/T381189 [08:13:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2018.codfw.wmnet [08:14:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10378493 (10ops-monitoring-bot) Draining ganeti2018.codfw.wmnet of running VMs [08:16:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 50%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71525 and previous config saved to /var/cache/conftool/dbconfig/20241204-081640-root.json [08:18:11] (03PS1) 10Slyngshede: Switch to upgraded Bitu node [dns] - 10https://gerrit.wikimedia.org/r/1100395 [08:18:27] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1100117|dialog: Don't duplicate the footer in the behaviour list template (T381189)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:18:30] T381189: Footer text on types of unacceptable behavior step is not in dialog footer - https://phabricator.wikimedia.org/T381189 [08:18:42] (03CR) 10DCausse: [C:04-1] "Will consider using the versioned stream conventions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [08:18:55] !log kharlan@deploy2002 kharlan: Continuing with sync [08:20:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2018.codfw.wmnet [08:21:18] 10SRE-swift-storage, 10Observability-Metrics: Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747#10378504 (10tappof) [08:21:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:23:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2018.codfw.wmnet [08:23:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10378507 (10ops-monitoring-bot) Draining ganeti2018.codfw.wmnet of running VMs [08:25:26] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100117|dialog: Don't duplicate the footer in the behaviour list template (T381189)]] (duration: 12m 08s) [08:25:28] T381189: Footer text on types of unacceptable behavior step is not in dialog footer - https://phabricator.wikimedia.org/T381189 [08:27:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 25%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71526 and previous config saved to /var/cache/conftool/dbconfig/20241204-082714-root.json [08:29:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1100395 (owner: 10Slyngshede) [08:31:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71527 and previous config saved to /var/cache/conftool/dbconfig/20241204-083145-root.json [08:35:18] !log rebalance Ganeti eqiad/C following server refreshes [08:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:31] (03CR) 10Slyngshede: [C:03+2] Switch to upgraded Bitu node [dns] - 10https://gerrit.wikimedia.org/r/1100395 (owner: 10Slyngshede) [08:42:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 50%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71528 and previous config saved to /var/cache/conftool/dbconfig/20241204-084219-root.json [08:42:51] (03Abandoned) 10Gehel: java: introduce a standard list of GC logging options for Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/954060 (https://phabricator.wikimedia.org/T345355) (owner: 10Gehel) [08:43:56] (03CR) 10Gehel: "Oh, I see! There are accesses to top level variables in statistics::published. That's confusing!" [puppet] - 10https://gerrit.wikimedia.org/r/924946 (owner: 10Gehel) [08:46:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71529 and previous config saved to /var/cache/conftool/dbconfig/20241204-084650-root.json [08:51:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2020 to es4 master T381259', diff saved to https://phabricator.wikimedia.org/P71530 and previous config saved to /var/cache/conftool/dbconfig/20241204-085124-marostegui.json [08:51:28] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [08:51:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2022 to clone es2043', diff saved to https://phabricator.wikimedia.org/P71531 and previous config saved to /var/cache/conftool/dbconfig/20241204-085143-marostegui.json [08:51:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2022.codfw.wmnet with reason: cloning [08:52:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2022.codfw.wmnet with reason: cloning [08:54:29] (03PS1) 10Marostegui: mariadb: Productionize es2043 [puppet] - 10https://gerrit.wikimedia.org/r/1100399 (https://phabricator.wikimedia.org/T381259) [08:55:36] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2043 [puppet] - 10https://gerrit.wikimedia.org/r/1100399 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [08:57:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71532 and previous config saved to /var/cache/conftool/dbconfig/20241204-085724-root.json [09:01:01] (03CR) 10JMeybohm: [C:03+1] mediawiki: support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [09:02:34] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 317, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:05:43] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2440,2442-2444].codfw.wmnet [09:07:57] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2440,2442-2444].codfw.wmnet [09:12:16] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:12:22] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[2440,2442-2444].codfw.wmnet with reason: T377877 [09:12:24] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:12:25] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [09:12:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71533 and previous config saved to /var/cache/conftool/dbconfig/20241204-091229-root.json [09:12:43] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[2440,2442-2444].codfw.wmnet with reason: T377877 [09:13:12] ACKNOWLEDGEMENT - MD RAID on mw2440 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T381469 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:13:20] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2440 - https://phabricator.wikimedia.org/T381469 (10ops-monitoring-bot) 03NEW [09:14:51] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2440.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:15:31] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2442.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:21:31] !log brouberol@cumin2002 START - Cookbook sre.hosts.decommission for hosts an-presto1001.eqiad.wmnet [09:24:30] (03PS1) 10Brouberol: aliases: change the an-presto-canary host [puppet] - 10https://gerrit.wikimedia.org/r/1100400 (https://phabricator.wikimedia.org/T381407) [09:26:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1100400 (https://phabricator.wikimedia.org/T381407) (owner: 10Brouberol) [09:28:18] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 234, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:28:28] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 317, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:29:15] (03PS3) 10Cathal Mooney: New ferm rule to permit HDFS data flows and mark as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) [09:29:43] (03CR) 10Brouberol: [C:03+2] aliases: change the an-presto-canary host [puppet] - 10https://gerrit.wikimedia.org/r/1100400 (https://phabricator.wikimedia.org/T381407) (owner: 10Brouberol) [09:29:52] (03CR) 10Cathal Mooney: New ferm rule to permit HDFS data flows and mark as low-prio for qos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [09:29:55] (03CR) 10CI reject: [V:04-1] New ferm rule to permit HDFS data flows and mark as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [09:30:32] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [09:30:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:46] (03PS4) 10Cathal Mooney: New ferm rule to permit HDFS data flows and mark as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) [09:31:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:31:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:32:52] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2442.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:33:00] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2443.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:33:11] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2444.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:34:17] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [09:34:36] !log brouberol@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-presto1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin2002" [09:35:07] !log brouberol@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-presto1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin2002" [09:35:08] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:35:09] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-presto1001.eqiad.wmnet [09:35:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2023 to es5 master T381259', diff saved to https://phabricator.wikimedia.org/P71534 and previous config saved to /var/cache/conftool/dbconfig/20241204-093519-marostegui.json [09:35:23] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [09:35:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024 to clone es2045', diff saved to https://phabricator.wikimedia.org/P71535 and previous config saved to /var/cache/conftool/dbconfig/20241204-093541-marostegui.json [09:35:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2024.codfw.wmnet with reason: cloning [09:36:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2024.codfw.wmnet with reason: cloning [09:38:18] (03PS1) 10Marostegui: mariadb: Productionize es2045 [puppet] - 10https://gerrit.wikimedia.org/r/1100404 (https://phabricator.wikimedia.org/T381259) [09:39:01] !log brouberol@cumin2002 START - Cookbook sre.hosts.decommission for hosts an-presto1002.eqiad.wmnet [09:39:27] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2045 [puppet] - 10https://gerrit.wikimedia.org/r/1100404 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [09:42:09] (03CR) 10Filippo Giunchedi: "We'll need 443 access from internal network too, e.g. prometheus sends probes towards 443" [puppet] - 10https://gerrit.wikimedia.org/r/1100144 (owner: 10Muehlenhoff) [09:44:08] (03CR) 10Muehlenhoff: "These are covered fleet-wide via the generic full-monitoring-metrics-access rule" [puppet] - 10https://gerrit.wikimedia.org/r/1100144 (owner: 10Muehlenhoff) [09:45:45] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10378675 (10JMeybohm) [09:46:22] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2440.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:47:39] ACKNOWLEDGEMENT - MD RAID on mw2444 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T381472 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:47:44] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2444 - https://phabricator.wikimedia.org/T381472 (10ops-monitoring-bot) 03NEW [09:49:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [09:50:28] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2443.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:50:29] (03PS1) 10Jaime Nuche: bootstrap-scap-target.sh: handle multiple wheel versions [puppet] - 10https://gerrit.wikimedia.org/r/1100407 [09:50:31] (03CR) 10Filippo Giunchedi: [C:03+1] "Doh, of course! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1100144 (owner: 10Muehlenhoff) [09:50:34] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2444.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:52:09] (03PS1) 10Tiziano Fogli: thanos/compactor: increase downsampling/compation concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1100403 (https://phabricator.wikimedia.org/T381466) [09:52:09] (03CR) 10Tiziano Fogli: "The changes are ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1100403 (https://phabricator.wikimedia.org/T381466) (owner: 10Tiziano Fogli) [09:52:42] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:54:28] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:00] (03PS1) 10JMeybohm: Rename mw244[02-4] to wikikube-worker201[56],wikikube-worker217[12] [puppet] - 10https://gerrit.wikimedia.org/r/1100408 (https://phabricator.wikimedia.org/T377877) [09:55:55] (03PS2) 10Jaime Nuche: bootstrap-scap-target.sh: handle multiple wheel versions [puppet] - 10https://gerrit.wikimedia.org/r/1100407 (https://phabricator.wikimedia.org/T380772) [09:56:20] !log bump space for prometheus k8s-mlserve in eqiad [09:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:39] (03PS3) 10Jaime Nuche: bootstrap-scap-target.sh: handle multiple wheel versions [puppet] - 10https://gerrit.wikimedia.org/r/1100407 (https://phabricator.wikimedia.org/T380772) [09:57:30] 07sre-alert-triage, 06Data-Platform-SRE, 06DBA: Alert in need of triage: PrometheusMysqldExporterFailed (instance db1208:13351) - https://phabricator.wikimedia.org/T376978#10378704 (10Marostegui) I believe so yes, and also this host is part of Analytics. [09:58:30] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [09:59:21] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/1100403 (https://phabricator.wikimedia.org/T381466) (owner: 10Tiziano Fogli) [10:00:17] 07sre-alert-triage, 06DBA, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Alert in need of triage: PrometheusMysqldExporterFailed (instance db1208:13351) - https://phabricator.wikimedia.org/T376978#10378721 (10BTullis) a:03BTullis Apologies for the delay. I'll have a look at this. If I recall correctly, th... [10:02:49] !log brouberol@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-presto1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin2002" [10:03:34] !log brouberol@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-presto1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin2002" [10:03:34] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:03:35] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-presto1002.eqiad.wmnet [10:04:23] !log brouberol@cumin2002 START - Cookbook sre.hosts.decommission for hosts an-presto1003.eqiad.wmnet [10:04:30] 07sre-alert-triage, 06DBA, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Alert in need of triage: PrometheusMysqldExporterFailed (instance db1208:13351) - https://phabricator.wikimedia.org/T376978#10378749 (10BTullis) 05Open→03Resolved I had already masked the service and reset the failed unit, but... [10:04:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2018.codfw.wmnet [10:06:58] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:07:29] (03CR) 10Volans: [C:04-1] "It looks to me that it can be simplified quite a bit" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [10:09:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on 8 hosts with reason: Rebooting [10:09:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 8 hosts with reason: Rebooting [10:10:13] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [10:13:42] !log brouberol@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-presto1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin2002" [10:13:43] (03CR) 10Marostegui: mysql: add port number to MysqlClient (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [10:15:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:23] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10378785 (10Marostegui) [10:17:26] (03CR) 10Volans: [WIP, DNM] create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [10:17:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10378790 (10Marostegui) Thank was fast! Thank you Jenn! [10:18:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1100408 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [10:18:56] (03CR) 10JMeybohm: [C:03+2] Rename mw244[02-4] to wikikube-worker201[56],wikikube-worker217[12] [puppet] - 10https://gerrit.wikimedia.org/r/1100408 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [10:19:12] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:19:12] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:19:43] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:19:54] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:19:55] !log brouberol@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-presto1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin2002" [10:19:55] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:19:56] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-presto1003.eqiad.wmnet [10:20:49] (03PS6) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) [10:20:54] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2440 to wikikube-worker2015 [10:20:54] (03CR) 10KCVelaga: Add Metrics Platform stream configuration for translate_extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [10:21:05] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:21:57] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:22:05] !log brouberol@cumin2002 START - Cookbook sre.hosts.decommission for hosts an-presto1004.eqiad.wmnet [10:22:18] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2442 to wikikube-worker20160 [10:22:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:22:30] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2442 to wikikube-worker20160 [10:22:40] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2443 to wikikube-worker2171 [10:22:53] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2444 to wikikube-worker2172 [10:23:10] !log jayme@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:23:20] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:23:47] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2442 to wikikube-worker2016 [10:23:59] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2442 to wikikube-worker2016 [10:25:15] (03PS5) 10Wangombe: Add Metrics Platform stream configuration for translate_extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) [10:25:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10378808 (10elukey) @Jclark-ctr I fixed the provisioning of ms-be1086, for some reasons if the BMC doesn't have IPv6 enabled the settings that errored ou... [10:25:49] (03CR) 10Wangombe: Add Metrics Platform stream configuration for translate_extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [10:26:45] brouberol: merged your an-presto1004 netbox changes [10:27:11] (03CR) 10Vgutierrez: [C:03+2] hiera: Extend bwlimit to upload cluster globally [puppet] - 10https://gerrit.wikimedia.org/r/1100137 (owner: 10Vgutierrez) [10:27:28] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2442 to wikikube-worker2016 [10:27:31] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2444 to wikikube-worker2172 - jayme@cumin2002" [10:27:40] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2442 to wikikube-worker2016 [10:27:55] (03CR) 10KCVelaga: [C:03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [10:28:23] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:28:24] !log enabling outbound bandwidth limits enforced by haproxy on the upload cluster [10:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:31] _joe_: ^^ [10:28:44] jayme: thanks! I have a decom cookbook running atm [10:29:03] ah, you currently hold the lock :D [10:29:07] brouberol: yeah, I saw that here - that's why I did not ask for confirmation :) [10:29:14] np [10:29:43] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2444 to wikikube-worker2172 - jayme@cumin2002" [10:29:43] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:29:44] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2172 [10:30:09] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2172 [10:30:46] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:30:47] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2171 [10:30:49] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2444 to wikikube-worker2172 [10:30:50] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [10:31:22] (03CR) 10Elukey: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [10:32:34] (03PS2) 10Tiziano Fogli: thanos/compactor: increase downsampling/compation concurrency [puppet] - 10https://gerrit.wikimedia.org/r/1100403 (https://phabricator.wikimedia.org/T381466) [10:33:12] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:33:13] !log removing ganeti2018 from active Ganeti nodes T376594 [10:33:13] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-presto1004.eqiad.wmnet [10:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:16] T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594 [10:33:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10378828 (10MoritzMuehlenhoff) [10:34:22] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:34:55] (03PS1) 10Hnowlan: mediawiki: various mercurius fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100412 (https://phabricator.wikimedia.org/T371701) [10:35:10] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2171 [10:35:23] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1100403 (https://phabricator.wikimedia.org/T381466) (owner: 10Tiziano Fogli) [10:35:44] (03PS1) 10Marostegui: installserver: Do not reimage es2041, es2042 [puppet] - 10https://gerrit.wikimedia.org/r/1100413 [10:35:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2443 to wikikube-worker2171 [10:36:02] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2172.codfw.wmnet with OS bookworm [10:36:10] PROBLEM - ganeti-noded running on ganeti2018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:36:10] PROBLEM - ganeti-confd running on ganeti2018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:36:13] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2172 [10:36:34] (03PS7) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) [10:36:43] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:36:44] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2015 [10:36:53] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:36:55] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2171.codfw.wmnet with OS bookworm [10:37:06] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2171 [10:37:08] FIRING: ProbeDown: Service ganeti2018:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:37:19] (03PS8) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) [10:37:29] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2015 [10:37:50] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2442 to wikikube-worker2016 [10:38:06] (03PS1) 10Muehlenhoff: ganeti2018: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1100414 (https://phabricator.wikimedia.org/T376594) [10:38:09] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2440 to wikikube-worker2015 [10:38:25] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4632/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [10:38:40] (03CR) 10Clément Goubert: [C:03+1] mediawiki: various mercurius fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100412 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [10:38:49] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2015.codfw.wmnet with OS bookworm [10:39:27] !log brouberol@cumin2002 START - Cookbook sre.hosts.decommission for hosts an-presto1005.eqiad.wmnet [10:39:29] (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki: various mercurius fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100412 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [10:39:47] (03PS9) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) [10:40:13] (03CR) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [10:40:38] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2172 - jayme@cumin2002" [10:40:51] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4633/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [10:41:03] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2172 - jayme@cumin2002" [10:41:03] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:41:04] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2172.codfw.wmnet 77.48.192.10.in-addr.arpa 7.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:41:07] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2172.codfw.wmnet 77.48.192.10.in-addr.arpa 7.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:41:08] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2172 [10:41:17] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2172 [10:41:18] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2172 [10:41:47] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:42:03] ACKNOWLEDGEMENT - MariaDB Replica SQL: s4 on db1245 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table wbc_entity_usage is corrupt: try to repair it on query. Default database: commonswiki. [Query snipped] Marostegui T381476 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:46:45] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2442 to wikikube-worker2016 - jayme@cumin2002" [10:46:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2442 to wikikube-worker2016 - jayme@cumin2002" [10:46:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:46:52] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2016 [10:46:53] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [10:47:07] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2016 [10:47:48] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2442 to wikikube-worker2016 [10:47:58] 10ops-codfw, 06DC-Ops, 06serviceops: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381478 (10JMeybohm) 03NEW [10:48:31] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2016.codfw.wmnet with OS bookworm [10:49:23] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:49:38] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-presto1005.eqiad.wmnet [10:49:38] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2016.codfw.wmnet with OS bookworm [10:50:05] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:52:29] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:52:29] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2171.codfw.wmnet 152.32.192.10.in-addr.arpa 2.5.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:52:32] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2171.codfw.wmnet 152.32.192.10.in-addr.arpa 2.5.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:52:33] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2171 [10:52:42] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2171 [10:52:42] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2171 [10:53:20] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2015 [10:53:27] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:54:07] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2016.codfw.wmnet with OS bookworm [10:55:29] (03PS1) 10Gmodena: EventStreamConfig: add content_history streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100417 (https://phabricator.wikimedia.org/T381322) [10:55:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:00] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2015 - jayme@cumin2002" [10:57:01] (03CR) 10Filippo Giunchedi: "There's a bunch of things to unpack I think, and I may be missing some context so please bear with me!" [puppet] - 10https://gerrit.wikimedia.org/r/1079531 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:57:06] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2015 - jayme@cumin2002" [10:57:06] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:57:06] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2015.codfw.wmnet 149.32.192.10.in-addr.arpa 9.4.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:57:10] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2015.codfw.wmnet 149.32.192.10.in-addr.arpa 9.4.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:57:11] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2015 [10:57:17] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2015.codfw.wmnet wikikube-worker2016.codfw.wmnet wikikube-worker2171.codfw.wmnet wikikube-worker2172.codfw.wmnet on all recursors [10:57:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2015.codfw.wmnet wikikube-worker2016.codfw.wmnet wikikube-worker2171.codfw.wmnet wikikube-worker2172.codfw.wmnet on all recursors [10:57:23] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2015 [10:57:23] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2015 [10:58:05] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2016 [10:58:31] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1100) [11:03:32] !log restarting haproxy on cp1107 [11:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:13] (03PS7) 10Aklapper: Redirect svn.wikimedia.org/doc properly [puppet] - 10https://gerrit.wikimedia.org/r/631888 (https://phabricator.wikimedia.org/T109950) (owner: 10Dereckson) [11:07:19] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2016 - jayme@cumin2002" [11:07:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2016 - jayme@cumin2002" [11:07:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:07:25] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2016.codfw.wmnet 151.32.192.10.in-addr.arpa 1.5.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:07:28] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2016.codfw.wmnet 151.32.192.10.in-addr.arpa 1.5.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:07:29] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2016 [11:07:39] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2016 [11:07:39] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2016 [11:07:44] (03PS1) 10Vgutierrez: Revert "hiera: Extend bwlimit to upload cluster globally" [puppet] - 10https://gerrit.wikimedia.org/r/1100419 [11:08:13] (03CR) 10Aklapper: "Attempted to rebase/amend. Also removed the generated file `modules/mediawiki/files/apache/sites/redirects.conf` from being included in th" [puppet] - 10https://gerrit.wikimedia.org/r/631888 (https://phabricator.wikimedia.org/T109950) (owner: 10Dereckson) [11:10:39] (03PS2) 10Vgutierrez: Revert "hiera: Extend bwlimit to upload cluster globally" [puppet] - 10https://gerrit.wikimedia.org/r/1100419 [11:11:33] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Extend bwlimit to upload cluster globally" [puppet] - 10https://gerrit.wikimedia.org/r/1100419 (owner: 10Vgutierrez) [11:11:48] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2171.codfw.wmnet with reason: host reimage [11:13:17] !log disabling outbound bandwidth limits enforced by haproxy on the upload cluster (we are getting haproxy crashes) [11:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:35] so convenient I'm the one on-call lol [11:14:35] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2171.codfw.wmnet with reason: host reimage [11:14:54] (03CR) 10Phuedx: [C:03+1] "The configuration LGTM and will work. I can't speak to the values that you're collecting for this stream though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [11:15:08] (03CR) 10Ilias Sarantopoulos: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [11:15:11] (03PS2) 10Marostegui: installserver: Do not reimage es2041, es2042 [puppet] - 10https://gerrit.wikimedia.org/r/1100413 [11:15:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 226, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:13] RESOLVED: ProbeDown: Service ganeti2018:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:21:01] (03PS1) 10Gmodena: dse-k8s: rename mw-dumps helmfiles. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) [11:24:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:24:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:24:40] (03PS2) 10Gmodena: dse-k8s-services: rename mw-dumps helmfiles. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) [11:25:31] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 310, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:26:30] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2016.codfw.wmnet with reason: host reimage [11:26:37] (03PS3) 10Gmodena: dse-k8s-services: rename mw-dumps helmfiles. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) [11:30:19] (03CR) 10Gmodena: "Some prep work to support the release of Dumps 2." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [11:32:25] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2016.codfw.wmnet with reason: host reimage [11:32:26] (03CR) 10Hnowlan: [C:03+2] mediawiki: various mercurius fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100412 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:34:00] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compactor: increase downsampling/compation concurrency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100403 (https://phabricator.wikimedia.org/T381466) (owner: 10Tiziano Fogli) [11:34:13] (03Merged) 10jenkins-bot: mediawiki: various mercurius fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100412 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:34:22] (03PS1) 10Vgutierrez: Revert^2 "hiera: Extend bwlimit to upload cluster globally" [puppet] - 10https://gerrit.wikimedia.org/r/1100423 [11:34:50] (03PS2) 10Vgutierrez: Revert^2 "hiera: Extend bwlimit to upload cluster globally" [puppet] - 10https://gerrit.wikimedia.org/r/1100423 [11:35:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100423 (owner: 10Vgutierrez) [11:35:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:35:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:36:46] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2171.codfw.wmnet with OS bookworm [11:38:18] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2172.codfw.wmnet with OS bookworm [11:39:02] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2172.codfw.wmnet with OS bookworm [11:40:53] (03PS3) 10Vgutierrez: Revert^2 "hiera: Extend bwlimit to upload cluster globally" [puppet] - 10https://gerrit.wikimedia.org/r/1100423 [11:41:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100423 (owner: 10Vgutierrez) [11:41:34] (03CR) 10Muehlenhoff: [C:03+2] ganeti2018: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1100414 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [11:42:30] (03CR) 10Stevemunene: [C:03+2] datahub: add datahub production index prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097372 (https://phabricator.wikimedia.org/T377814) (owner: 10Stevemunene) [11:43:35] (03Merged) 10jenkins-bot: datahub: add datahub production index prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1097372 (https://phabricator.wikimedia.org/T377814) (owner: 10Stevemunene) [11:45:58] (03CR) 10Muehlenhoff: [C:03+2] graphite: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100144 (owner: 10Muehlenhoff) [11:46:17] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2041, es2042 [puppet] - 10https://gerrit.wikimedia.org/r/1100413 (owner: 10Marostegui) [11:47:57] (03PS1) 10Dreamy Jazz: Create a DB list for wikis with continuous MediaModeration scans [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100426 (https://phabricator.wikimedia.org/T355169) [11:48:20] (03CR) 10Fabfur: [C:03+1] "Looks that `filter` directive is now present on all hosts, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1100423 (owner: 10Vgutierrez) [11:48:22] jouncebot: nowandnext [11:48:22] For the next 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1100) [11:48:22] In 0 hour(s) and 11 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1200) [11:48:51] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "hiera: Extend bwlimit to upload cluster globally" [puppet] - 10https://gerrit.wikimedia.org/r/1100423 (owner: 10Vgutierrez) [11:49:06] (03PS10) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) [11:49:30] (03CR) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [11:49:39] !log re-enabling outbound bandwidth limits enforced by haproxy on the upload cluster [11:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:18] PROBLEM - MariaDB Replica Lag: s1 on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:26] (03PS1) 10Dreamy Jazz: [WIP] Update MediaModeration module to run scans automatically [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) [11:50:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [11:50:56] (03PS2) 10Dreamy Jazz: Create a DB list for wikis with continuous MediaModeration scans [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100426 (https://phabricator.wikimedia.org/T355169) [11:51:05] (03CR) 10CI reject: [V:04-1] [WIP] Update MediaModeration module to run scans automatically [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [11:51:09] (03PS2) 10Dreamy Jazz: [WIP] Update MediaModeration module to run scans automatically [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) [11:51:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100426 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [11:52:23] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2016.codfw.wmnet with OS bookworm [11:52:36] (03Merged) 10jenkins-bot: Create a DB list for wikis with continuous MediaModeration scans [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100426 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [11:53:04] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1100426|Create a DB list for wikis with continuous MediaModeration scans (T355169)]] [11:53:06] T355169: Run scanFilesInScanTable.php automatically on WMF wikis - https://phabricator.wikimedia.org/T355169 [11:56:30] (03PS1) 10Dreamy Jazz: Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100430 (https://phabricator.wikimedia.org/T355169) [11:58:02] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2172.codfw.wmnet with reason: host reimage [11:59:12] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1100426|Create a DB list for wikis with continuous MediaModeration scans (T355169)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:59:14] T355169: Run scanFilesInScanTable.php automatically on WMF wikis - https://phabricator.wikimedia.org/T355169 [11:59:23] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [12:00:04] mvolz: Your horoscope predicts another Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1200). [12:01:05] (03PS1) 10Stevemunene: datahub: Rebuild datahub for java updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100433 (https://phabricator.wikimedia.org/T377938) [12:01:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [12:01:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [12:02:18] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2172.codfw.wmnet with reason: host reimage [12:03:25] (03CR) 10Dreamy Jazz: "Want to backport this so that the fix is ready for when puppet runs the scripts automatically." [extensions/MediaModeration] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100430 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:03:36] (03PS1) 10Dreamy Jazz: Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100434 (https://phabricator.wikimedia.org/T355169) [12:04:47] (03CR) 10CI reject: [V:04-1] Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100430 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:05:15] (03CR) 10Dreamy Jazz: "recheck" [extensions/MediaModeration] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100430 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:05:26] jouncebot: nowandnext [12:05:26] For the next 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1200) [12:05:26] In 1 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1400) [12:05:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/MediaModeration] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100430 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:06:06] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100426|Create a DB list for wikis with continuous MediaModeration scans (T355169)]] (duration: 13m 02s) [12:06:09] T355169: Run scanFilesInScanTable.php automatically on WMF wikis - https://phabricator.wikimedia.org/T355169 [12:06:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/MediaModeration] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100434 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:09:21] (03CR) 10Elukey: [C:03+1] ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [12:12:27] (03CR) 10Ilias Sarantopoulos: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [12:13:50] (03PS11) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) [12:14:16] (03CR) 10Klausman: ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [12:17:07] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4634/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [12:17:57] (03CR) 10Dreamy Jazz: "recheck" [extensions/MediaModeration] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100430 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:18:13] (03CR) 10Klausman: [V:03+1 C:03+2] ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc [puppet] - 10https://gerrit.wikimedia.org/r/1100056 (https://phabricator.wikimedia.org/T371344) (owner: 10Klausman) [12:22:30] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2172.codfw.wmnet with OS bookworm [12:22:33] (03CR) 10Dreamy Jazz: Ensure IP reveal buttons are not shown on Special:MassGlobalBlock (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [12:22:39] (03CR) 10Dreamy Jazz: [C:03+1] Ensure IP reveal buttons are not shown on Special:MassGlobalBlock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [12:22:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [12:23:13] (03CR) 10Dreamy Jazz: [C:03+1] "I can deploy this in the upcoming window where I have other changes to deploy too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [12:25:54] (03CR) 10Brouberol: [C:03+1] datahub: Rebuild datahub for java updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100433 (https://phabricator.wikimedia.org/T377938) (owner: 10Stevemunene) [12:26:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:31:17] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1001.eqiad.wmnet - https://phabricator.wikimedia.org/T381487#10379330 (10brouberol) [12:31:18] (03CR) 10Btullis: [C:03+1] datahub: Rebuild datahub for java updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100433 (https://phabricator.wikimedia.org/T377938) (owner: 10Stevemunene) [12:31:21] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381488 (10brouberol) 03NEW [12:31:48] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1003.eqiad.wmnet - https://phabricator.wikimedia.org/T381489 (10brouberol) 03NEW [12:32:13] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1004.eqiad.wmnet - https://phabricator.wikimedia.org/T381490 (10brouberol) 03NEW [12:32:32] !log installing glib2.0 security updates [12:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:46] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1005.eqiad.wmnet - https://phabricator.wikimedia.org/T381491 (10brouberol) 03NEW [12:32:53] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1001.eqiad.wmnet - https://phabricator.wikimedia.org/T381487#10379389 (10brouberol) [12:33:12] (03CR) 10Stevemunene: [C:03+2] datahub: Rebuild datahub for java updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100433 (https://phabricator.wikimedia.org/T377938) (owner: 10Stevemunene) [12:33:59] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:34:26] (03Merged) 10jenkins-bot: datahub: Rebuild datahub for java updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100433 (https://phabricator.wikimedia.org/T377938) (owner: 10Stevemunene) [12:35:05] (03PS1) 10Dreamy Jazz: Stats: Move StatsFactory flush into emitBufferedStats [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100442 (https://phabricator.wikimedia.org/T380609) [12:35:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100442 (https://phabricator.wikimedia.org/T380609) (owner: 10Dreamy Jazz) [12:35:44] jouncebot: nowandnext [12:35:44] For the next 0 hour(s) and 24 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1200) [12:35:44] In 1 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1400) [12:36:03] Going to start with gate-and-submit-wmf for some of the backports, as they will take a time to complete [12:36:20] (03CR) 10Dreamy Jazz: [C:03+2] Stats: Move StatsFactory flush into emitBufferedStats [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100442 (https://phabricator.wikimedia.org/T380609) (owner: 10Dreamy Jazz) [12:36:35] (03CR) 10Dreamy Jazz: [C:03+2] Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100434 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:36:39] (03CR) 10Dreamy Jazz: [C:03+2] Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100430 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:36:57] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1001.eqiad.wmnet - https://phabricator.wikimedia.org/T381487#10379411 (10brouberol) [12:37:08] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381488#10379413 (10brouberol) [12:37:19] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1003.eqiad.wmnet - https://phabricator.wikimedia.org/T381489#10379419 (10brouberol) [12:37:22] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1004.eqiad.wmnet - https://phabricator.wikimedia.org/T381490#10379421 (10brouberol) [12:37:27] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-presto1005.eqiad.wmnet - https://phabricator.wikimedia.org/T381491#10379423 (10brouberol) [12:38:17] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [12:40:06] !log imported debs for mercurius_1.0.2 [12:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:40] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [12:44:28] jouncebot: nowandnext [12:44:29] For the next 0 hour(s) and 15 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1200) [12:44:29] In 1 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1400) [12:47:08] !log uploaded mailman3 3.3.8-2~deb12u2+wmf1 T377045 [12:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:11] T377045: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045 [12:47:18] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [12:49:12] (03PS4) 10Gmodena: dse-k8s-services: rename mw-dumps helmfiles. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) [12:49:28] (03PS1) 10Mvolz: Enable wayback in config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100443 (https://phabricator.wikimedia.org/T369084) [12:50:25] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:11] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [12:52:54] (03CR) 10Mvolz: [C:03+2] Enable wayback in config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100443 (https://phabricator.wikimedia.org/T369084) (owner: 10Mvolz) [12:53:21] RECOVERY - MariaDB Replica Lag: s1 on db1206 is OK: OK slave_sql_lag Replication lag: 4.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:53:59] (03Merged) 10jenkins-bot: Enable wayback in config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100443 (https://phabricator.wikimedia.org/T369084) (owner: 10Mvolz) [12:54:44] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[1004,1009].eqiad.wmnet with reason: Hardware refresh [12:54:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[1004,1009].eqiad.wmnet with reason: Hardware refresh [12:55:31] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:55:35] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:56:14] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100444 (https://phabricator.wikimedia.org/T219903) [12:56:33] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:57:17] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:57:20] (03Merged) 10jenkins-bot: Stats: Move StatsFactory flush into emitBufferedStats [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100442 (https://phabricator.wikimedia.org/T380609) (owner: 10Dreamy Jazz) [12:57:23] (03Merged) 10jenkins-bot: Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1100434 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:57:25] (03Merged) 10jenkins-bot: Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php [extensions/MediaModeration] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100430 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [12:58:26] (03PS3) 10Hnowlan: mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) [12:58:41] I'm going to start scap for these wmf backports as they have merged and it is nearly the time for the window [12:58:43] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10379487 (10MoritzMuehlenhoff) >>! In T377045#10377117, @Dzahn wrote: > Since we could not test if the service starts on list2001 (fails because... [12:59:01] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100444 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [12:59:12] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]] [12:59:16] T380609: Maintenance scripts do not emit StatsLib metrics - https://phabricator.wikimedia.org/T380609 [12:59:17] T355169: Run scanFilesInScanTable.php automatically on WMF wikis - https://phabricator.wikimedia.org/T355169 [12:59:55] Scap failed with: [13:00:05] mergeMessageFileList.php generated PHP notices/warnings: Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.6/includes/debug/logger/monolog/LegacyHandler.php on line 234 [13:00:32] (03CR) 10DDesouza: [V:03+2 C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100444 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [13:00:34] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100444 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [13:00:35] aren't logs sent over UDP? ??? [13:00:37] Going to re-try scap [13:00:51] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]] [13:01:11] Failed again. [13:01:44] or well the patch you are deploying breaks the world [13:02:00] Maybe [13:02:11] I doubt it would be the MediaModeration patches [13:02:22] Perhaps the core patch is suspect [13:04:18] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:04:25] who knows [13:04:35] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:04:36] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:04:50] but that logger code is a socket_sendto() harnessed behind a $this->useUdp() [13:04:56] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:04:57] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:05:02] why it cant send to udp .. I have no clue [13:05:11] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:05:23] (03PS1) 10Effie Mouzeli: kafka-main: Replace kafka-main1004 with kafka-main1009 [puppet] - 10https://gerrit.wikimedia.org/r/1100447 (https://phabricator.wikimedia.org/T363214) [13:05:38] (03PS1) 10Jelto: Rename kubernetes1023 and kubernetes1024 [puppet] - 10https://gerrit.wikimedia.org/r/1100448 (https://phabricator.wikimedia.org/T377876) [13:05:50] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:05:53] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:05:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [13:05:55] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:05:58] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:05:59] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:06:01] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:06:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [13:06:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T371742)', diff saved to https://phabricator.wikimedia.org/P71537 and previous config saved to /var/cache/conftool/dbconfig/20241204-130614-ladsgroup.json [13:06:18] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:06:23] Given that I successfully backported a config change less than an hour ago, I'll try undoing the core change. [13:06:33] To see if it was the core. [13:06:36] *core patc [13:06:39] *patch [13:07:03] (03PS1) 10Dreamy Jazz: Revert "Stats: Move StatsFactory flush into emitBufferedStats" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100449 [13:07:21] (03PS2) 10Dreamy Jazz: Revert "Stats: Move StatsFactory flush into emitBufferedStats" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100449 [13:07:28] (03CR) 10Dreamy Jazz: [C:03+2] Revert "Stats: Move StatsFactory flush into emitBufferedStats" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100449 (owner: 10Dreamy Jazz) [13:08:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100449 (owner: 10Dreamy Jazz) [13:09:40] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:09:43] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:09:44] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:09:47] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:09:49] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:09:51] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:11:02] (03PS18) 10Arnaudb: mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) [13:12:46] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:12:49] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:12:51] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:12:54] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:12:55] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:12:57] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:14:17] those helmfile `!log` should be cut somehow [13:14:22] that is rather spammy [13:14:55] hashar: and meaningless. i have no idea what was deployed there. [13:15:09] miscweb accross all 3 namespaces [13:15:41] i meant what changed inside miscweb [13:17:40] the helm diff should tell you what you are changing/deploying. Miscweb is in one namespace + query service in another namespace. [13:18:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1154.eqiad.wmnet with reason: Alter table [13:18:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1154.eqiad.wmnet with reason: Alter table [13:19:06] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10379577 (10BBlack) Seems like a net win to me. Reduces some error-prone process stuff and makes life simpler! [13:19:32] jelto: i know, but that is only shown when i'm the deployer, right? not when i'm looking at SAL and trying to figure out what changed from the logs. [13:19:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Alter table [13:19:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Alter table [13:19:45] or is the diff stored somewhere for later reivew? [13:19:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Alter table [13:19:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Alter table [13:20:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Alter table [13:20:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Alter table [13:20:39] that's right, the diff is only visible for the deployer. But the diff can contain secrets and credentials so it's not public [13:20:48] urbanecm: well you gotta check the deployment-charts repo, find out some image version got bumped, from there head to the repo definining it. You can do a diff of the image yes [13:21:30] (03PS1) 10Slyngshede: Password reset: use passlib for hashing [software/bitu] - 10https://gerrit.wikimedia.org/r/1100451 [13:22:13] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10379599 (10BBlack) Also, probably the way to standardize this for sanity (avoiding ORIGIN mistakes on both ends) is to follow some simple rules that: 1. Every one of the new includ... [13:23:17] (03PS1) 10Effie Mouzeli: Update various kafka-main connection strings for kafka-main1009 Replacing kafka-main1004 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100452 (https://phabricator.wikimedia.org/T363214) [13:27:45] (03Merged) 10jenkins-bot: Revert "Stats: Move StatsFactory flush into emitBufferedStats" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1100449 (owner: 10Dreamy Jazz) [13:28:15] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100449|Revert "Stats: Move StatsFactory flush into emitBufferedSta [13:28:15] ts"]] [13:28:19] T380609: Maintenance scripts do not emit StatsLib metrics - https://phabricator.wikimedia.org/T380609 [13:28:19] T355169: Run scanFilesInScanTable.php automatically on WMF wikis - https://phabricator.wikimedia.org/T355169 [13:29:01] Scap is working with the core backport reverted. [13:30:15] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes1023 and kubernetes1024 [puppet] - 10https://gerrit.wikimedia.org/r/1100448 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [13:30:46] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1023-1024].eqiad.wmnet [13:31:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1023-1024].eqiad.wmnet [13:32:57] (03CR) 10Jelto: [C:03+2] Rename kubernetes1023 and kubernetes1024 [puppet] - 10https://gerrit.wikimedia.org/r/1100448 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [13:33:49] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100449|Revert "Stats: Move StatsFactory flush into emitBufferedStats"]] synced [13:33:49] to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:33:57] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [13:33:58] T380609: Maintenance scripts do not emit StatsLib metrics - https://phabricator.wikimedia.org/T380609 [13:33:58] T355169: Run scanFilesInScanTable.php automatically on WMF wikis - https://phabricator.wikimedia.org/T355169 [13:35:17] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2015.codfw.wmnet with OS bookworm [13:35:46] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2015.codfw.wmnet with OS bookworm [13:38:14] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10379654 (10cmooney) [13:39:05] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1023 to wikikube-worker1036 [13:39:10] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10379658 (10cmooney) [13:39:24] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:39:42] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10379659 (10cmooney) >>! In T362985#10379599, @BBlack wrote: > Also, probably the way to standardize this for sanity (avoiding ORIGIN mistakes on both ends) is to follow some simple... [13:39:53] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:53] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:41:14] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2016,2171-2172].codfw.wmnet [13:41:16] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2016,2171-2172].codfw.wmnet [13:42:13] PROBLEM - BGP status on lsw1-c5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:42:53] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100449|Revert "Stats: Move StatsFactory flush into emitBufferedSt [13:42:53] ats"]] (duration: 14m 38s) [13:42:57] T380609: Maintenance scripts do not emit StatsLib metrics - https://phabricator.wikimedia.org/T380609 [13:42:57] T355169: Run scanFilesInScanTable.php automatically on WMF wikis - https://phabricator.wikimedia.org/T355169 [13:43:10] abijeet: You here for the window? [13:43:11] (03CR) 10Filippo Giunchedi: alerts: enable paging mariadb through prometheus (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1100042 (https://phabricator.wikimedia.org/T381276) (owner: 10Arnaudb) [13:43:57] I'm going to not deploy the core backport, as it appears to be broken on production. [13:45:21] (03CR) 10Dreamy Jazz: [C:03+2] Ensure IP reveal buttons are not shown on Special:MassGlobalBlock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [13:45:25] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:05] (03Merged) 10jenkins-bot: Ensure IP reveal buttons are not shown on Special:MassGlobalBlock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [13:47:01] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1100150|Ensure IP reveal buttons are not shown on Special:MassGlobalBlock (T124607)]] [13:47:08] T124607: Create a special page for mass global (un)block - https://phabricator.wikimedia.org/T124607 [13:47:24] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1023 to wikikube-worker1036 - jelto@cumin1002" [13:47:34] jouncebot: nowandnext [13:47:34] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [13:47:34] In 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1400) [13:47:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1023 to wikikube-worker1036 - jelto@cumin1002" [13:47:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:47:42] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1036 [13:47:56] Not the window yet. Got myself confused as to when it started. [13:48:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1036 [13:48:59] (03CR) 10Xcollazo: [C:03+1] "Cursory look from my side." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [13:49:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1023 to wikikube-worker1036 [13:49:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes1024:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1024 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:50:05] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1024 to wikikube-worker1037 [13:50:09] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2445-2447].codfw.wmnet [13:50:25] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:51:56] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2445-2447].codfw.wmnet [13:52:46] (03PS2) 10Slyngshede: Password reset: use passlib for hashing [software/bitu] - 10https://gerrit.wikimedia.org/r/1100451 (https://phabricator.wikimedia.org/T381327) [13:53:10] !log dreamyjazz@deploy2002 tchanders, dreamyjazz: Backport for [[gerrit:1100150|Ensure IP reveal buttons are not shown on Special:MassGlobalBlock (T124607)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:53:14] T124607: Create a special page for mass global (un)block - https://phabricator.wikimedia.org/T124607 [13:53:34] !log dreamyjazz@deploy2002 tchanders, dreamyjazz: Continuing with sync [13:54:14] (03PS4) 10Alexandros Kosiaris: gateway-check: Make indentation consistent [puppet] - 10https://gerrit.wikimedia.org/r/1100111 [13:54:14] (03PS14) 10Alexandros Kosiaris: gateway-check: Support (and use) per wiki rules [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) [13:54:14] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1024 to wikikube-worker1037 - jelto@cumin1002" [13:54:18] (03CR) 10Alexandros Kosiaris: gateway-check: Support (and use) per wiki rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [13:54:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1100451 (https://phabricator.wikimedia.org/T381327) (owner: 10Slyngshede) [13:54:28] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1024 to wikikube-worker1037 - jelto@cumin1002" [13:54:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:42] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1037 [13:54:55] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2015.codfw.wmnet with reason: host reimage [13:55:43] (03PS1) 10Muehlenhoff: maps: Allow disabling the installation of kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1100456 [13:55:43] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1037 [13:55:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:59] (03CR) 10Slyngshede: [C:03+2] Password reset: use passlib for hashing [software/bitu] - 10https://gerrit.wikimedia.org/r/1100451 (https://phabricator.wikimedia.org/T381327) (owner: 10Slyngshede) [13:56:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1024 to wikikube-worker1037 [13:56:31] PROBLEM - Host mw2445 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:17] this is me [13:57:59] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on mw[2445-2447].codfw.wmnet with reason: reimage [13:58:13] (03CR) 10Alexandros Kosiaris: [C:04-1] kafka-main: Replace kafka-main1004 with kafka-main1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100447 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [13:58:18] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw[2445-2447].codfw.wmnet with reason: reimage [13:58:24] RECOVERY - Host mw2445 is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms [13:58:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2015.codfw.wmnet with reason: host reimage [13:58:29] (03Merged) 10jenkins-bot: Password reset: use passlib for hashing [software/bitu] - 10https://gerrit.wikimedia.org/r/1100451 (https://phabricator.wikimedia.org/T381327) (owner: 10Slyngshede) [13:59:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10379753 (10Jclark-ctr) [13:59:17] (03PS1) 10Giuseppe Lavagetto: profile::mariadb::core: alert on all replicas [puppet] - 10https://gerrit.wikimedia.org/r/1100457 (https://phabricator.wikimedia.org/T381276) [13:59:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10379756 (10Jclark-ctr) [14:00:10] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100150|Ensure IP reveal buttons are not shown on Special:MassGlobalBlock (T124607)]] (duration: 13m 08s) [14:00:14] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1400). [14:00:15] abijeet and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] T124607: Create a special page for mass global (un)block - https://phabricator.wikimedia.org/T124607 [14:00:22] \o [14:00:35] My backporting is now done [14:00:38] o/ [14:00:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [14:00:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1044.eqiad.wmnet with OS bookworm [14:00:56] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1045.eqiad.wmnet with OS bookworm [14:00:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10379763 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm [14:01:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10379764 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm [14:01:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10379765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm [14:01:09] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2445.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:01:18] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2447.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:01:24] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2446.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:01:37] I want to go for lunch. Would another deployer be able to deploy the remaining change? [14:01:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100456 (owner: 10Muehlenhoff) [14:01:50] (03CR) 10Marostegui: [C:03+1] profile::mariadb::core: alert on all replicas [puppet] - 10https://gerrit.wikimedia.org/r/1100457 (https://phabricator.wikimedia.org/T381276) (owner: 10Giuseppe Lavagetto) [14:01:54] (03PS2) 10Effie Mouzeli: kafka-main: Replace kafka-main1004 with kafka-main1009 [puppet] - 10https://gerrit.wikimedia.org/r/1100447 (https://phabricator.wikimedia.org/T363214) [14:02:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:02:54] o/ [14:02:57] I think I can deploy in a few minutes [14:04:07] Cool, thanks! [14:04:44] Just throwing it out there, I didn't get the sticker for breaking legalteam wiki [14:04:55] I helped fix it too [14:05:09] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1036.eqiad.wmnet wikikube-worker1037.eqiad.wmnet on all recursors [14:05:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1036.eqiad.wmnet wikikube-worker1037.eqiad.wmnet on all recursors [14:05:37] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks both!" [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:05:45] (03CR) 10Alexandros Kosiaris: [C:03+2] gateway-check: Make indentation consistent [puppet] - 10https://gerrit.wikimedia.org/r/1100111 (owner: 10Alexandros Kosiaris) [14:06:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:06:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:06:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:07:26] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1036.eqiad.wmnet with OS bookworm [14:07:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100352 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:07:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:07:48] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 226, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:07:52] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 310, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:08:25] (03CR) 10CI reject: [V:04-1] Translate: Enable message group subscription for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100352 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:08:40] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:23] (03CR) 10Abijeet Patro: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100352 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:09:51] o_O [14:10:00] oh. just another DNS failure [14:10:09] (03CR) 10Lucas Werkmeister (WMDE): "probably due to T374830" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100352 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:10:11] (03CR) 10Alexandros Kosiaris: [C:03+1] kafka-main: Replace kafka-main1004 with kafka-main1009 [puppet] - 10https://gerrit.wikimedia.org/r/1100447 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [14:10:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100352 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:10:18] took me a second to see it [14:10:53] (03Merged) 10jenkins-bot: Translate: Enable message group subscription for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100352 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:11:20] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1100352|Translate: Enable message group subscription for 6 wikis (T372386)]] [14:11:24] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:13:47] (03CR) 10Alexandros Kosiaris: gateway-check: Support (and use) per wiki rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [14:17:22] !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1100352|Translate: Enable message group subscription for 6 wikis (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:25] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:17:29] abijeet: please test :) [14:18:08] Lucas_WMDE, on it [14:18:20] RECOVERY - BGP status on lsw1-c5-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:18:27] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2015.codfw.wmnet with OS bookworm [14:19:05] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [14:20:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10379859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye [14:20:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:20:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53070 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:20:42] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:08] Lucas_WMDE, Looks OK on my end. Please proceed. [14:22:56] ok thanks! [14:22:58] !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Continuing with sync [14:23:36] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1036.eqiad.wmnet with reason: host reimage [14:23:46] (03CR) 10Xcollazo: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100417 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [14:27:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1036.eqiad.wmnet with reason: host reimage [14:28:00] Dreamy_Jazz: is there a task for the scap problem where its internal maintennace scripts fail due to network being blocked / due to MW now using the network from a script that was previously presumed to be offline (= enabling statslib in maint = the patch). [14:28:06] (03PS1) 10Brouberol: mw-dump-rev-content-reconcile-enrich: rename namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) [14:28:15] This will break scap again next week if not fixed before then [14:28:33] I tagged the associated task as a train blocker for the next train [14:28:41] I haven't filed a separate task for it. [14:28:54] Found it, thanks! [14:29:32] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100352|Translate: Enable message group subscription for 6 wikis (T372386)]] (duration: 18m 12s) [14:29:35] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:30:08] anything else to deploy? [14:30:27] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T381464#10379896 (10andrea.denisse) a:03andrea.denisse [14:31:12] Lucas_WMDE, thank you. [14:31:18] (03PS3) 10Andrew Bogott: Remove ceph references to cloudcephosd100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1098095 (https://phabricator.wikimedia.org/T380893) [14:31:18] (03PS2) 10Andrew Bogott: Remove refs to cloudcephmon100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1098096 (https://phabricator.wikimedia.org/T380893) [14:31:18] yw :) [14:31:23] !log UTC afternoon backport+config window done [14:31:24] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098095 (https://phabricator.wikimedia.org/T380893) (owner: 10Andrew Bogott) [14:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:36] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10379903 (10ssingh) 05Open→03Resolved a:03ssingh [14:32:52] (03PS5) 10Dreamy Jazz: Update MediaModeration module to run scans automatically [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) [14:34:24] Dreamy_Jazz: do you have access to a stack trace from that mergeMessageFileList.php warning? [14:34:36] There was no stack trace printed to the console. [14:34:49] Is there somewhere else it could have been printed to? [14:35:06] I see, yeah, there wouldn't be if it's plain php-cli stderr. MediaWiki obtains a trace when reporting them to Logstash. [14:35:28] But given it's run in an enforced-offline context while building the docker image, that would have been lost and/or disabled by configuration. [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:27] (03CR) 10Gmodena: mw-dump-rev-content-reconcile-enrich: rename namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [14:40:01] (03PS6) 10Fabfur: cache:haproxy: longer capture buffers for relevant headers [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) [14:40:16] (03CR) 10Brouberol: mw-dump-rev-content-reconcile-enrich: rename namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [14:40:41] (03CR) 10Andrew Bogott: [C:03+2] Remove ceph references to cloudcephosd100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1098095 (https://phabricator.wikimedia.org/T380893) (owner: 10Andrew Bogott) [14:40:55] (03CR) 10Muehlenhoff: "Good catch! Some questions and comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [14:40:58] (03PS1) 10Hnowlan: php8.1: rebuild to pick up mercurius images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100462 (https://phabricator.wikimedia.org/T371701) [14:42:56] (03CR) 10Alexandros Kosiaris: [C:03+1] php8.1: rebuild to pick up mercurius images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100462 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:43:16] (03CR) 10Kamila Součková: [C:03+1] php8.1: rebuild to pick up mercurius images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100462 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:44:03] (03CR) 10Hnowlan: [C:03+1] Update various kafka-main connection strings for kafka-main1009 Replacing kafka-main1004 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100452 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [14:44:31] jouncebot: nowandnext [14:44:31] For the next 0 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1400) [14:44:31] In 0 hour(s) and 15 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1500) [14:44:36] (03CR) 10Hnowlan: [V:03+2 C:03+2] php8.1: rebuild to pick up mercurius images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1100462 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:45:25] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [14:46:06] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage [14:46:26] hihi (cc Lucas_WMDE) — I intend to deploy 1090502, any issues? [14:46:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1036.eqiad.wmnet with OS bookworm [14:46:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T371742)', diff saved to https://phabricator.wikimedia.org/P71540 and previous config saved to /var/cache/conftool/dbconfig/20241204-144651-ladsgroup.json [14:46:54] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:46:55] (03CR) 10Brouberol: mw-dump-rev-content-reconcile-enrich: rename namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [14:47:00] (03CR) 10Brouberol: dse-k8s-services: rename mw-dumps helmfiles. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [14:47:17] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1037.eqiad.wmnet with OS bookworm [14:47:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) (owner: 10Srishakatux) [14:48:12] (03Merged) 10jenkins-bot: Add new namespaces to hsb wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) (owner: 10Srishakatux) [14:48:45] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1090502|Add new namespaces to hsb wiktionary (T373634)]] [14:48:48] T373634: Add new namespaces to hsb.wiktionary.org - https://phabricator.wikimedia.org/T373634 [14:49:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage [14:50:11] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2445.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:50:17] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2447.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:50:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2446.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:51:25] !log samtar@deploy2002 samtar, srishakatux: Backport for [[gerrit:1090502|Add new namespaces to hsb wiktionary (T373634)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:51:34] (03CR) 10Muehlenhoff: php: Allow provisioning MediaWiki with PHP 8.1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [14:51:46] TheresNoTime: I saw it late but feel free to go ahead :) [14:52:20] !log samtar@deploy2002 samtar, srishakatux: Continuing with sync [14:52:33] (also now I’m imagining us sending around “intend to deploy” announcement emails like browsers do for “intend to ship” lol) [14:53:35] (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM, but this will need a manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French) [14:54:57] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2015.codfw.wmnet [14:54:59] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2015.codfw.wmnet [14:56:28] jouncebot: nowandnext [14:56:28] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1400) [14:56:28] In 0 hour(s) and 3 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1500) [14:56:53] (03PS1) 10Bking: cumin: add aliases for net-new wdqs services [puppet] - 10https://gerrit.wikimedia.org/r/1100465 (https://phabricator.wikimedia.org/T376150) [14:57:02] * TheresNoTime is almost done deploying [14:59:01] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090502|Add new namespaces to hsb wiktionary (T373634)]] (duration: 10m 16s) [14:59:04] T373634: Add new namespaces to hsb.wiktionary.org - https://phabricator.wikimedia.org/T373634 [15:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1500) [15:00:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100465 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [15:00:59] (03PS7) 10Fabfur: cache:haproxy: longer capture buffers for relevant headers [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) [15:01:20] !log '[samtar@deploy2002 ~]$ mwscript-k8s --comment="T373634" -f -- namespaceDupes.php --wiki hsbwiktionary --fix' for T373634 [15:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:47] * TheresNoTime done [15:01:55] RECOVERY - MariaDB Replica SQL: s4 on db1245 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:01:58] custom shell prompt spotted 👀 [15:01:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P71541 and previous config saved to /var/cache/conftool/dbconfig/20241204-150157-ladsgroup.json [15:02:12] (03CR) 10Brouberol: [C:03+1] cumin: add aliases for net-new wdqs services [puppet] - 10https://gerrit.wikimedia.org/r/1100465 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [15:02:50] (03CR) 10Bking: [C:03+2] cumin: add aliases for net-new wdqs services [puppet] - 10https://gerrit.wikimedia.org/r/1100465 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [15:03:14] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1037.eqiad.wmnet with reason: host reimage [15:03:47] (03PS1) 10JMeybohm: Rename mw244[5-7] to wikikube-worker217[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1100467 (https://phabricator.wikimedia.org/T377877) [15:06:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1037.eqiad.wmnet with reason: host reimage [15:06:45] (03CR) 10Vgutierrez: [C:03+1] cache:haproxy: longer capture buffers for relevant headers [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [15:06:51] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [15:07:30] (03CR) 10Jelto: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1100467 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [15:08:23] (03CR) 10JMeybohm: [C:03+2] Rename mw244[5-7] to wikikube-worker217[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1100467 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [15:08:59] (03CR) 10Fabfur: [C:03+2] cache:haproxy: longer capture buffers for relevant headers [puppet] - 10https://gerrit.wikimedia.org/r/1100113 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [15:09:10] (03CR) 10BBlack: [C:03+1] Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521 (owner: 10CDobbins) [15:10:09] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2445 to wikikube-worker2173 [15:10:20] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [15:10:23] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2446 to wikikube-worker2174 [15:10:28] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2447 to wikikube-worker2175 [15:13:04] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:14:39] (03CR) 10Effie Mouzeli: [C:03+2] kafka-main: Replace kafka-main1004 with kafka-main1009 [puppet] - 10https://gerrit.wikimedia.org/r/1100447 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [15:16:06] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [15:17:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P71542 and previous config saved to /var/cache/conftool/dbconfig/20241204-151705-ladsgroup.json [15:17:21] I'm gonna do a quick scap sync-world to rebuild the 8.1 production images if there are no objections [15:18:00] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2445 to wikikube-worker2173 - jayme@cumin2002" [15:18:28] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2445 to wikikube-worker2173 - jayme@cumin2002" [15:18:28] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:29] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2173 [15:19:13] (03PS1) 10CDanis: tunnelencabulator: add upload-lb support [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1100470 [15:20:22] (03PS2) 10CDanis: tunnelencabulator: add upload-lb support [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1100470 [15:20:23] (03CR) 10Giuseppe Lavagetto: [C:03+1] tunnelencabulator: add upload-lb support [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1100470 (owner: 10CDanis) [15:20:27] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2173 [15:20:48] !log jiji@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [15:20:52] (03PS1) 10Marostegui: Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1100471 [15:21:04] (03CR) 10Giuseppe Lavagetto: [C:03+1] tunnelencabulator: add upload-lb support [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1100470 (owner: 10CDanis) [15:21:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P71543 and previous config saved to /var/cache/conftool/dbconfig/20241204-152105-root.json [15:21:08] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2445 to wikikube-worker2173 [15:21:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1045.eqiad.wmnet with OS bookworm [15:21:25] (03CR) 10CDanis: [V:03+2 C:03+2] tunnelencabulator: add upload-lb support [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1100470 (owner: 10CDanis) [15:21:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10380106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm ex... [15:21:34] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2446 to wikikube-worker2174 - jayme@cumin2002" [15:21:39] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2446 to wikikube-worker2174 - jayme@cumin2002" [15:21:40] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:21:41] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2174 [15:21:48] (03CR) 10Marostegui: [C:03+2] Revert "db1167: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1100471 (owner: 10Marostegui) [15:22:02] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [15:22:06] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2174 [15:22:47] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2446 to wikikube-worker2174 [15:23:24] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10380111 (10JMeybohm) [15:24:08] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381478#10380114 (10JMeybohm) [15:24:26] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:27] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2175 [15:24:42] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2175 [15:24:45] !log hnowlan@deploy2002 Started scap sync-world: Rebuild and deploy to pick up new php8.1 base [15:25:22] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2447 to wikikube-worker2175 [15:26:19] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2173.codfw.wmnet with OS bookworm [15:26:30] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2173 [15:26:43] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2173.codfw.wmnet wikikube-worker2174.codfw.wmnet wikikube-worker2175.codfw.wmnet on all recursors [15:26:46] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2173.codfw.wmnet wikikube-worker2174.codfw.wmnet wikikube-worker2175.codfw.wmnet on all recursors [15:26:47] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [15:26:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1037.eqiad.wmnet with OS bookworm [15:27:37] !log homer 'cr*eqiad*' commit 'T377876' [15:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:40] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [15:27:40] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2174.codfw.wmnet with OS bookworm [15:27:45] (03CR) 10Giuseppe Lavagetto: [C:03+2] profile::mariadb::core: alert on all replicas [puppet] - 10https://gerrit.wikimedia.org/r/1100457 (https://phabricator.wikimedia.org/T381276) (owner: 10Giuseppe Lavagetto) [15:28:26] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2175.codfw.wmnet with OS bookworm [15:30:25] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2173 - jayme@cumin2002" [15:30:31] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2173 - jayme@cumin2002" [15:30:31] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:31] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2173.codfw.wmnet 78.48.192.10.in-addr.arpa 8.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:30:35] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2173.codfw.wmnet 78.48.192.10.in-addr.arpa 8.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:30:37] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2173 [15:30:49] !log jiji@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [15:31:26] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2173 [15:31:27] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2173 [15:31:39] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2174 [15:31:47] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [15:32:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T371742)', diff saved to https://phabricator.wikimedia.org/P71544 and previous config saved to /var/cache/conftool/dbconfig/20241204-153212-ladsgroup.json [15:32:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:32:15] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:32:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:32:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T371742)', diff saved to https://phabricator.wikimedia.org/P71545 and previous config saved to /var/cache/conftool/dbconfig/20241204-153234-ladsgroup.json [15:36:01] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2174 - jayme@cumin2002" [15:36:07] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2174 - jayme@cumin2002" [15:36:07] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:08] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2174.codfw.wmnet 79.48.192.10.in-addr.arpa 9.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:36:11] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2174.codfw.wmnet 79.48.192.10.in-addr.arpa 9.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:36:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P71546 and previous config saved to /var/cache/conftool/dbconfig/20241204-153611-root.json [15:36:12] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2174 [15:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:36:39] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2174 [15:36:39] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2174 [15:37:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:37:26] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2175 [15:39:18] (03PS5) 10Gmodena: dse-k8s-services: rename mw-dumps helmfiles. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) [15:39:29] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [15:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:41:33] (03CR) 10Volans: "post-merge -1, doesn't work" [puppet] - 10https://gerrit.wikimedia.org/r/1100465 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [15:41:33] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1036-1037].eqiad.wmnet [15:41:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1036-1037].eqiad.wmnet [15:42:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:43:18] (03Abandoned) 10Ahmon Dancy: bootstrap-scap-target.sh: Temp hard code scap version [puppet] - 10https://gerrit.wikimedia.org/r/1100204 (https://phabricator.wikimedia.org/T380772) (owner: 10Ahmon Dancy) [15:45:53] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2175 - jayme@cumin2002" [15:45:59] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2175 - jayme@cumin2002" [15:45:59] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:59] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2175.codfw.wmnet 80.48.192.10.in-addr.arpa 0.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:46:03] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2175.codfw.wmnet 80.48.192.10.in-addr.arpa 0.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:46:04] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2175 [15:47:46] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504 (10Jelto) 03NEW [15:49:20] (03CR) 10JHathaway: puppet 7: fix facter.conf location (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [15:50:12] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2440 - https://phabricator.wikimedia.org/T381469#10380226 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm T 381478 - renamed server to wikikube-worker2015 T 358489 - probably false alert from this ticket. [15:50:33] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2173.codfw.wmnet with reason: host reimage [15:50:53] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2444 - https://phabricator.wikimedia.org/T381472#10380230 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm T 381478 - renamed server to wikikube-worker2172 T 358489 - probably false alert from this ticket. [15:51:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2175 [15:51:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2175 [15:51:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P71548 and previous config saved to /var/cache/conftool/dbconfig/20241204-155116-root.json [15:51:26] (03CR) 10Ahmon Dancy: [C:03+1] "Looks correct. One optional suggestion." [puppet] - 10https://gerrit.wikimedia.org/r/1100407 (https://phabricator.wikimedia.org/T380772) (owner: 10Jaime Nuche) [15:52:03] (03PS2) 10Pcoombe: CSP for banner preview: allow remind me later SMS host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier) [15:54:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2173.codfw.wmnet with reason: host reimage [15:54:48] (03PS1) 10CDanis: tunnelencabulator: bump version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1100475 [15:54:57] (03CR) 10Ssingh: [C:03+1] "Not sure why in your PCC run cp7001 wasn't checked since you clearly specify it but" [puppet] - 10https://gerrit.wikimedia.org/r/1090814 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:54:57] (03CR) 10CDanis: [V:03+2 C:03+2] tunnelencabulator: bump version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1100475 (owner: 10CDanis) [15:55:33] (03PS1) 10Muehlenhoff: yarn: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100476 [15:55:56] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2174.codfw.wmnet with reason: host reimage [15:58:16] (03CR) 10Bking: [C:03+2] opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:58:24] (03PS4) 10Jaime Nuche: bootstrap-scap-target.sh: handle multiple wheel versions [puppet] - 10https://gerrit.wikimedia.org/r/1100407 (https://phabricator.wikimedia.org/T380772) [15:58:25] (03CR) 10Bking: opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:58:32] 06SRE, 10Dumps 2.0, 10Dumps-Generation: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10380240 (10Marostegui) @xcollazo what the status of this? We keep seeing issues with dumps. We just got the enwiki dumps replica lagged again while dumps were r... [15:58:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100476 (owner: 10Muehlenhoff) [15:58:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090814 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:59:05] (03CR) 10Ahmon Dancy: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1100407 (https://phabricator.wikimedia.org/T380772) (owner: 10Jaime Nuche) [15:59:32] (03CR) 10Gmodena: mw-dump-rev-content-reconcile-enrich: rename namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [16:01:33] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2174.codfw.wmnet with reason: host reimage [16:04:34] (03CR) 10Jaime Nuche: bootstrap-scap-target.sh: handle multiple wheel versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100407 (https://phabricator.wikimedia.org/T380772) (owner: 10Jaime Nuche) [16:06:16] (03CR) 10CDanis: [C:03+1] New ferm rule to permit HDFS data flows and mark as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [16:06:16] !log hnowlan@deploy2002 Finished scap sync-world: Rebuild and deploy to pick up new php8.1 base (duration: 42m 17s) [16:06:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P71549 and previous config saved to /var/cache/conftool/dbconfig/20241204-160622-root.json [16:06:31] (03CR) 10Jaime Nuche: "@cgoubert@wikimedia.org We need this change to fix bootstrapping of new Scap hosts. Could you help with merging? :)" [puppet] - 10https://gerrit.wikimedia.org/r/1100407 (https://phabricator.wikimedia.org/T380772) (owner: 10Jaime Nuche) [16:06:39] (03CR) 10CDanis: [C:03+1] P:idp enable JMX exporter [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [16:07:22] (03CR) 10Bking: "Elasticsearch's docs warn that "there is no validation to block unsupported settings from the keystore and they can cause Elasticsearch to" [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [16:07:58] (03CR) 10Bking: [C:03+1] opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [16:08:06] (03CR) 10SBassett: [C:03+1] "This incorporates the new connect-src directive limitation within the policy, which was suggested by acooper." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier) [16:08:12] (03CR) 10CDanis: [C:03+2] chart-renderer: scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100202 (https://phabricator.wikimedia.org/T379687) (owner: 10CDanis) [16:08:44] jnuche: is the change tested and ok to merge? [16:09:14] claime: yeah, I tested it on our scap3-dev env [16:09:21] ok cool [16:09:25] (03Merged) 10jenkins-bot: chart-renderer: scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100202 (https://phabricator.wikimedia.org/T379687) (owner: 10CDanis) [16:09:25] (03CR) 10Clément Goubert: [C:03+2] bootstrap-scap-target.sh: handle multiple wheel versions [puppet] - 10https://gerrit.wikimedia.org/r/1100407 (https://phabricator.wikimedia.org/T380772) (owner: 10Jaime Nuche) [16:09:26] (03PS1) 10CDanis: bump chart-renderer chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100479 [16:09:38] (03CR) 10CDanis: [C:03+2] bump chart-renderer chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100479 (owner: 10CDanis) [16:09:57] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2175.codfw.wmnet with reason: host reimage [16:09:58] claime: thx! [16:10:41] jnuche: merged, do you need me to run puppet on a specific server? [16:10:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:10:48] (03Merged) 10jenkins-bot: bump chart-renderer chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100479 (owner: 10CDanis) [16:11:00] Hi all, I'm planning to run a maintenance script to add wikidata support for idwikivoyage as per T381083. Is that disruptive to your current activities/ should I do it another time? [16:11:00] T381083: Add Wikidata support for idwikivoyage - https://phabricator.wikimedia.org/T381083 [16:11:53] claime: would be good to verify on `wdqs1027.eqiad.wmnet` [16:12:08] (03CR) 10Brouberol: mw-dump-rev-content-reconcile-enrich: rename namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [16:12:26] claime jnuche what's up with wdqs1027? LMK if I can help [16:12:36] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [16:12:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1025.eqiad.wmnet with OS bullseye [16:12:39] jnuche: puppet running [16:12:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10380301 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye completed: - wdqs1025... [16:13:15] inflatador: ryankemper ran into an issue with the Scap bootstrapping there last night [16:13:20] jnuche: puppet run done, what's to be done afterwards? [16:13:27] (03PS2) 10Brouberol: mw-dump-rev-content-reconcile-enrich: rename namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) [16:13:29] he worked around it, but now we've merged an actual fix [16:13:33] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2175.codfw.wmnet with reason: host reimage [16:13:36] (03CR) 10Brouberol: mw-dump-rev-content-reconcile-enrich: rename namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [16:13:52] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2173.codfw.wmnet with OS bookworm [16:14:04] claime: if it didn't fail, that should be it :) you can run `scap version` to double-check Scap is healthy on the box [16:14:17] jnuche ACK, I just reimaged wdqs1025 so we'll check it there later today. Probably be a few hrs though [16:14:17] 4.132.0 [16:14:18] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:14:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381478#10380295 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:14:27] claime: awesome, ty again! [16:14:30] It didn't trigger anything, just changed the script, btw [16:14:47] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:14:49] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:14:58] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:15:00] !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:15:36] !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:15:38] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:15:52] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:15:53] !log jiji@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [16:16:33] !log jiji@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [16:16:34] claime: hmm, that actually makes sense, is there an easy way to persuade Puppet to rerun the resource associated to the script? [16:16:34] !log jiji@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:16:51] !log jiji@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:16:52] !log jiji@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:17:04] jnuche: let me check the code [16:17:08] !log jiji@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:17:10] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:17:47] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:17:49] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [16:18:06] (03CR) 10Bking: "Does it make sense to export any of these stats as Prometheus metrics?" [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [16:18:14] (03CR) 10Fabfur: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1090814 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [16:18:27] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:19:15] (03CR) 10Brouberol: [C:03+1] cumin: add aliases for net-new wdqs services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100465 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [16:19:41] !log jiji@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main1009.eqiad.wmnet [16:19:42] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main1009.eqiad.wmnet [16:20:47] (03CR) 10Effie Mouzeli: [C:03+2] Update various kafka-main connection strings for kafka-main1009 Replacing kafka-main1004 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100452 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:21:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P71550 and previous config saved to /var/cache/conftool/dbconfig/20241204-162127-root.json [16:21:35] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2174.codfw.wmnet with OS bookworm [16:21:39] (03PS1) 10Hnowlan: jobqueue: temporarily toggle video job to test mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100481 (https://phabricator.wikimedia.org/T371701) [16:21:48] (03PS1) 10Clément Goubert: scap: Trigger bootstrap-scap-target.sh on script change [puppet] - 10https://gerrit.wikimedia.org/r/1100482 (https://phabricator.wikimedia.org/T380772) [16:21:58] jnuche: ^ this should do the trick [16:22:07] (03CR) 10Andy Cooper: [C:03+1] CSP for banner preview: allow remind me later SMS host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier) [16:22:22] (03Merged) 10jenkins-bot: Update various kafka-main connection strings for kafka-main1009 Replacing kafka-main1004 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100452 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:22:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Schema change [16:22:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Schema change [16:23:05] this will trigger the exec every time the subscribed resource (the script) changes [16:24:21] jnuche: the subscribe on the file resource for the symlink is maybe unnecessary [16:25:04] claime: hmm, not sure if we want to add that behavior, merging a faulty bootstrap script in the future could wipe out Scap from all targets [16:25:16] instead of just reimaged machines [16:25:22] (03PS4) 10JHathaway: puppet 7: fix facter.conf location [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) [16:25:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [16:26:20] jnuche: then there's not really a way to do it except running the script manually [16:26:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [16:26:53] claime: doing that on a different box as we speaak [16:26:58] ack [16:27:12] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcephmon1001.eqiad.wmnet [16:27:27] (03Abandoned) 10Clément Goubert: scap: Trigger bootstrap-scap-target.sh on script change [puppet] - 10https://gerrit.wikimedia.org/r/1100482 (https://phabricator.wikimedia.org/T380772) (owner: 10Clément Goubert) [16:27:38] (03PS1) 10Kamila Součková: Rename mw149[1-6] to wikikube-worker10[38-42] [puppet] - 10https://gerrit.wikimedia.org/r/1100483 (https://phabricator.wikimedia.org/T377876) [16:28:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier) [16:29:01] claime: running `sudo -su scap /usr/local/bin/bootstrap-scap-target.sh deployment /var/lib/scap` after the script updated on the machine worked :) [16:29:13] cool :) [16:29:13] jnuche@releases2003:~$ scap version [16:29:13] 4.132.0 [16:29:24] claime: I think we're good, thanks a lot again [16:29:27] np [16:32:44] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2175.codfw.wmnet with OS bookworm [16:32:54] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [16:33:17] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [16:33:21] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [16:33:21] (03CR) 10JHathaway: [C:03+2] puppet 7: fix facter.conf location [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [16:33:38] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [16:33:39] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:33:48] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [16:34:06] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [16:34:07] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [16:34:43] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [16:34:48] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [16:34:55] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:35:19] (03CR) 10JHathaway: [C:03+2] puppet 7: fix facter.conf location (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100161 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [16:35:21] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2173-2175].codfw.wmnet [16:35:29] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2173-2175].codfw.wmnet [16:35:50] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 304, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:36:14] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [16:36:33] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [16:36:35] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [16:36:38] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:36:49] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:36:53] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:37:10] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [16:37:11] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:37:12] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:37:44] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:38:34] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephmon1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [16:38:53] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephmon1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [16:38:53] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:38:54] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephmon1001.eqiad.wmnet [16:39:04] (03PS1) 10Hnowlan: mediawiki: correct mercurius command-args config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100486 (https://phabricator.wikimedia.org/T371701) [16:40:22] (03CR) 10Scott French: [C:03+1] mediawiki: correct mercurius command-args config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100486 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:41:58] (03CR) 10Hnowlan: [C:03+2] mediawiki: correct mercurius command-args config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100486 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:43:02] (03CR) 10Scott French: [C:03+1] jobqueue: temporarily toggle video job to test mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100481 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:43:16] (03CR) 10Hnowlan: [C:03+2] jobqueue: temporarily toggle video job to test mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100481 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:43:25] (03CR) 10Hnowlan: mediawiki: correct mercurius command-args config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100486 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:44:19] (03Merged) 10jenkins-bot: jobqueue: temporarily toggle video job to test mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100481 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:45:29] (03CR) 10Hnowlan: [C:03+2] mediawiki: correct mercurius command-args config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100486 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:45:34] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [16:46:24] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [16:46:38] (03PS12) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [16:47:25] (03PS1) 10Cathal Mooney: Increase the number of gnmic worker and writer threads [puppet] - 10https://gerrit.wikimedia.org/r/1100488 (https://phabricator.wikimedia.org/T369384) [16:47:25] (03Merged) 10jenkins-bot: mediawiki: correct mercurius command-args config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100486 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:48:31] (03CR) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [16:49:58] (03PS2) 10Cathal Mooney: Increase the number of gnmic worker and writer threads [puppet] - 10https://gerrit.wikimedia.org/r/1100488 (https://phabricator.wikimedia.org/T369384) [16:51:06] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcephmon1002.eqiad.wmnet [16:55:10] (03PS1) 10JHathaway: facter: fix facter conf location [puppet] - 10https://gerrit.wikimedia.org/r/1100490 (https://phabricator.wikimedia.org/T330490) [16:55:32] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100490 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [16:56:00] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [16:57:39] (03CR) 10Muehlenhoff: [C:03+1] "Doh, ofc!" [puppet] - 10https://gerrit.wikimedia.org/r/1100490 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [16:59:17] (03PS2) 10JHathaway: facter: fix facter conf location [puppet] - 10https://gerrit.wikimedia.org/r/1100490 (https://phabricator.wikimedia.org/T330490) [16:59:23] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100490 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [16:59:36] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephmon1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [16:59:42] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100491 [16:59:52] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephmon1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [16:59:52] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:53] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephmon1002.eqiad.wmnet [17:00:10] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcephmon1003.eqiad.wmnet [17:03:02] (03CR) 10JHathaway: [C:03+2] facter: fix facter conf location [puppet] - 10https://gerrit.wikimedia.org/r/1100490 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [17:04:38] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [17:08:50] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephmon1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [17:10:01] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephmon1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [17:10:01] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:10:02] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephmon1003.eqiad.wmnet [17:10:22] (03CR) 10Andrew Bogott: [C:03+2] Remove refs to cloudcephmon100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1098096 (https://phabricator.wikimedia.org/T380893) (owner: 10Andrew Bogott) [17:13:21] 10ops-eqiad, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10380484 (10Andrew) [17:14:39] !incidents [17:14:39] 5508 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [17:14:40] 5507 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [17:14:40] 5506 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [17:14:40] 5505 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [17:15:15] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 220, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:15:17] (03PS1) 10Btullis: [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) [17:15:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T371742)', diff saved to https://phabricator.wikimedia.org/P71551 and previous config saved to /var/cache/conftool/dbconfig/20241204-171530-ladsgroup.json [17:15:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:15:55] (03CR) 10CI reject: [V:04-1] [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) (owner: 10Btullis) [17:17:22] (03PS2) 10Btullis: [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) [17:18:29] (03PS3) 10Btullis: [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) [17:19:15] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4637/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) (owner: 10Btullis) [17:20:29] (03CR) 10CI reject: [V:04-1] [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) (owner: 10Btullis) [17:22:24] (03PS5) 10Alexandros Kosiaris: gateway-check: avoid mutation of gateway_config [puppet] - 10https://gerrit.wikimedia.org/r/1100474 (owner: 10Scott French) [17:22:24] (03CR) 10Alexandros Kosiaris: [C:03+1] "Good catch. This didn't have a very large time window of biting, but it could end up being confusing to some cases. Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1100474 (owner: 10Scott French) [17:22:27] (03PS4) 10Btullis: [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) [17:22:45] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10380540 (10Jhancock.wm) heads up UPS didn't deliver yesterday. still waiting. [17:22:51] (03CR) 10Btullis: [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) (owner: 10Btullis) [17:23:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4638/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) (owner: 10Btullis) [17:23:50] (03CR) 10Andrew Bogott: "Here is the full output from one node:" [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [17:25:28] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1100474 (owner: 10Scott French) [17:25:31] (03CR) 10Scott French: [C:03+2] gateway-check: avoid mutation of gateway_config [puppet] - 10https://gerrit.wikimedia.org/r/1100474 (owner: 10Scott French) [17:25:59] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:13] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:30:06] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:30:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P71553 and previous config saved to /var/cache/conftool/dbconfig/20241204-173037-ladsgroup.json [17:31:16] (03PS1) 10Clare Ming: Metrics Platform Instrument/Experiment Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100502 (https://phabricator.wikimedia.org/T379247) [17:34:24] (03CR) 10Scott French: "Thanks for the review, Alexandros!" [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French) [17:34:25] (03PS1) 10Clare Ming: Metrics Platform Instrument/Experiment Configurator: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100504 (https://phabricator.wikimedia.org/T379247) [17:37:06] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument/Experiment Configurator: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100504 (https://phabricator.wikimedia.org/T379247) (owner: 10Clare Ming) [17:37:10] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument/Experiment Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100502 (https://phabricator.wikimedia.org/T379247) (owner: 10Clare Ming) [17:38:08] (03Merged) 10jenkins-bot: Metrics Platform Instrument/Experiment Configurator: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100504 (https://phabricator.wikimedia.org/T379247) (owner: 10Clare Ming) [17:38:16] (03Merged) 10jenkins-bot: Metrics Platform Instrument/Experiment Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100502 (https://phabricator.wikimedia.org/T379247) (owner: 10Clare Ming) [17:42:02] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1100488 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [17:45:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P71554 and previous config saved to /var/cache/conftool/dbconfig/20241204-174544-ladsgroup.json [17:47:25] (03PS5) 10Scott French: mediawiki: support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) [17:47:25] (03PS5) 10Scott French: mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) [17:47:25] (03PS5) 10Scott French: mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040) [17:47:25] (03PS5) 10Scott French: mediawiki: add remaining migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082863 (https://phabricator.wikimedia.org/T377040) [17:47:26] (03PS5) 10Scott French: mediawiki: remove migration release overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082864 (https://phabricator.wikimedia.org/T377040) [17:49:09] (03CR) 10Scott French: "Thanks for the re-review! Just rebased and re-bumped the chart version. I'll merge this during the upcoming infra window." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:50:04] !log joal@deploy2002 Started deploy [analytics/refinery@6e3ee14]: Regular analytics weekly train [analytics/refinery@6e3ee14b] [17:50:30] !log Moved SAL fediverse posts to https://wikimedia.social/@sal. Many thanks to botsin.space for providing hosting for so long. [17:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:10] !log joal@deploy2002 Finished deploy [analytics/refinery@6e3ee14]: Regular analytics weekly train [analytics/refinery@6e3ee14b] (duration: 02m 05s) [17:54:07] !log joal@deploy2002 Started deploy [analytics/refinery@6e3ee14] (thin): Regular analytics weekly train THIN [analytics/refinery@6e3ee14b] [17:54:28] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:44] !log joal@deploy2002 Finished deploy [analytics/refinery@6e3ee14] (thin): Regular analytics weekly train THIN [analytics/refinery@6e3ee14b] (duration: 00m 37s) [17:54:49] !log joal@deploy2002 Started deploy [analytics/refinery@6e3ee14] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6e3ee14b] [17:55:20] !log joal@deploy2002 Finished deploy [analytics/refinery@6e3ee14] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6e3ee14b] (duration: 00m 31s) [17:56:15] (03CR) 10Cathal Mooney: [C:03+2] Increase the number of gnmic worker and writer threads [puppet] - 10https://gerrit.wikimedia.org/r/1100488 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [18:00:04] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1800). [18:00:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T371742)', diff saved to https://phabricator.wikimedia.org/P71555 and previous config saved to /var/cache/conftool/dbconfig/20241204-180052-ladsgroup.json [18:00:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [18:00:56] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:01:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [18:01:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T371742)', diff saved to https://phabricator.wikimedia.org/P71556 and previous config saved to /var/cache/conftool/dbconfig/20241204-180114-ladsgroup.json [18:01:22] here a bit earlier than expected, and will start work shortly [18:01:57] (03CR) 10Scott French: [C:03+2] mediawiki: support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:04:01] (03Merged) 10jenkins-bot: mediawiki: support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:04:16] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [18:04:37] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [18:06:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:09:30] (03CR) 10Máté Szabó: [C:03+1] Update MediaModeration module to run scans automatically [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [18:11:24] !log swfrench@deploy2002 Started scap sync-world: Deployment to clear noop chart diff from 1081449 - T377040 [18:11:30] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:13:31] !log swfrench@deploy2002 Finished scap sync-world: Deployment to clear noop chart diff from 1081449 - T377040 (duration: 02m 07s) [18:13:55] (03PS1) 10Dbrant: push-notifications: Add proxy env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100513 (https://phabricator.wikimedia.org/T379647) [18:15:13] all done on my end [18:15:41] (03PS1) 10CDanis: app/generic copypatch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100514 [18:15:41] (03PS1) 10CDanis: app/generic: add support for a metricsPort [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100515 [18:15:41] (03PS1) 10CDanis: chart-renderer: use the metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100516 (https://phabricator.wikimedia.org/T379687) [18:22:03] (03PS2) 10Dbrant: push-notifications: Add proxy env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100513 (https://phabricator.wikimedia.org/T379647) [18:24:54] (03PS3) 10CDanis: push-notifications: New release & proxy env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100513 (https://phabricator.wikimedia.org/T379647) (owner: 10Dbrant) [18:24:57] (03CR) 10CDanis: [C:03+2] push-notifications: New release & proxy env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100513 (https://phabricator.wikimedia.org/T379647) (owner: 10Dbrant) [18:25:59] (03Merged) 10jenkins-bot: push-notifications: New release & proxy env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100513 (https://phabricator.wikimedia.org/T379647) (owner: 10Dbrant) [18:30:22] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [18:30:56] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [18:31:19] (03CR) 10Scott French: "Thanks, Alexandros!" [puppet] - 10https://gerrit.wikimedia.org/r/1100112 (https://phabricator.wikimedia.org/T374683) (owner: 10Alexandros Kosiaris) [18:33:55] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [18:34:41] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [18:35:04] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [18:35:43] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [18:39:08] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: pybal pools for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097541 (https://phabricator.wikimedia.org/T379330) (owner: 10Ryan Kemper) [18:46:19] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-internal-scholarly,service=wdqs-scholarly [18:46:31] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-internal-main,service=wdqs-main [18:47:52] !log T379330 `wdqs-internal-main` and `wdqs-internal-scholarly` pools created [18:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:55] T379330: Create pybal pools for wdqs-internal-main and wdqs-internal-scholarly - https://phabricator.wikimedia.org/T379330 [18:47:58] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [18:48:11] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [18:50:00] jouncebot: nowandnext [18:50:00] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1800) [18:50:01] In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1900) [18:50:34] (03PS5) 10Ryan Kemper: wdqs-internal: add A & PTR records for graph split [dns] - 10https://gerrit.wikimedia.org/r/1100010 (https://phabricator.wikimedia.org/T379334) [18:50:34] (03PS5) 10Ryan Kemper: wdqs-internal: add graph split disc DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) (owner: 10Bking) [18:52:12] !log T379334 Creating A and PTR records for `wdqs-internal-main` and `wdqs-internal-scholarly` VIPs [merging https://gerrit.wikimedia.org/r/c/operations/dns/+/1100010/ & running authdns update after] [18:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:15] T379334: Create DNS records for wdqs-internal-main and wdqs-internal-scholarly - https://phabricator.wikimedia.org/T379334 [18:52:19] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: add A & PTR records for graph split [dns] - 10https://gerrit.wikimedia.org/r/1100010 (https://phabricator.wikimedia.org/T379334) (owner: 10Ryan Kemper) [18:54:24] (03PS7) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [18:55:03] !log T379334 Successfully ran `sudo authdns-update` on `dns1004` [18:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:52] <_Gerges> Hi, Does the T381445 task need community consensus? [18:56:52] T381445: Add "Noto Sans Arabic" Font - https://phabricator.wikimedia.org/T381445 [18:57:52] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache wdqs-internal-main.svc.eqiad.wmnet on all recursors [18:57:56] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs-internal-main.svc.eqiad.wmnet on all recursors [18:58:02] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:58:05] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:58:24] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:58:53] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [19:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T1900) [19:02:05] !log T380555 Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094061 to establish initial service definitions for `wdqs-internal-main` and `wdqs-internal-scholarly` [19:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:08] T380555: Enable LVS for wdqs-internal-[main,scholarly] - https://phabricator.wikimedia.org/T380555 [19:02:11] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:02:19] (03CR) 10Scott French: [C:03+1] "LGTM, with maybe two stale comments." [puppet] - 10https://gerrit.wikimedia.org/r/1100483 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [19:02:32] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [19:03:02] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [19:04:50] (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1094069/4639/wdqs2018.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:05:13] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097542 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [19:09:18] !log T379333 Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097542 to establish envoy on `A:wdqs-internal-main` and `A:wdqs-internal-scholarly`; running puppet on `wdqs2018` to test change [19:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:21] T379333: Create envoy config for wdqs-internal-main and wdqs-internal-scholarly - https://phabricator.wikimedia.org/T379333 [19:13:12] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10381052 (10VRiley-WMF) Understood, I will close this this and ask for a replacement! [19:13:24] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10381053 (10VRiley-WMF) 05Open→03Resolved [19:13:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal-main.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:14:34] oh ok [19:14:34] (03PS2) 10Kamila Součková: Rename mw149[1-6] to wikikube-worker10[38-42] [puppet] - 10https://gerrit.wikimedia.org/r/1100483 (https://phabricator.wikimedia.org/T377876) [19:14:35] let's see [19:15:53] (03CR) 10Kamila Součková: Rename mw149[1-6] to wikikube-worker10[38-42] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1100483 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [19:16:31] !log T380555 Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094069 to enable `lvs::realserver` [19:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:34] T380555: Enable LVS for wdqs-internal-[main,scholarly] - https://phabricator.wikimedia.org/T380555 [19:16:41] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:17:20] (03CR) 10Scott French: [C:03+1] Rename mw149[1-6] to wikikube-worker10[38-42] [puppet] - 10https://gerrit.wikimedia.org/r/1100483 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [19:19:15] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1491-1496].eqiad.wmnet [19:19:54] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:20:26] !log joal@deploy2002 Started deploy [analytics/refinery@1f94312]: Regular analytics weekly train - HOTFIX [analytics/refinery@1f94312a] [19:21:12] (03CR) 10Kamila Součková: [C:03+2] Rename mw149[1-6] to wikikube-worker10[38-42] [puppet] - 10https://gerrit.wikimedia.org/r/1100483 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [19:22:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1491-1496].eqiad.wmnet [19:23:43] !log joal@deploy2002 Finished deploy [analytics/refinery@1f94312]: Regular analytics weekly train - HOTFIX [analytics/refinery@1f94312a] (duration: 03m 17s) [19:23:57] !log joal@deploy2002 Started deploy [analytics/refinery@1f94312] (thin): Regular analytics weekly train THIN - HOTFIX [analytics/refinery@1f94312a] [19:24:28] !log joal@deploy2002 Finished deploy [analytics/refinery@1f94312] (thin): Regular analytics weekly train THIN - HOTFIX [analytics/refinery@1f94312a] (duration: 00m 30s) [19:24:39] !log joal@deploy2002 Started deploy [analytics/refinery@1f94312] (hadoop-test): Regular analytics weekly train TEST - HOTFIX [analytics/refinery@1f94312a] [19:25:03] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1491 to wikikube-worker1038 [19:25:06] !log joal@deploy2002 Finished deploy [analytics/refinery@1f94312] (hadoop-test): Regular analytics weekly train TEST - HOTFIX [analytics/refinery@1f94312a] (duration: 00m 26s) [19:25:23] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:26:21] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1492 to wikikube-worker1039 [19:26:54] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:27:16] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:27:18] (03CR) 10RLazarus: [C:03+1] app/generic copypatch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100514 (owner: 10CDanis) [19:27:22] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:27:23] (03CR) 10RLazarus: [C:03+1] app/generic: add support for a metricsPort [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100515 (owner: 10CDanis) [19:27:47] (03CR) 10CDanis: [C:03+2] app/generic copypatch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100514 (owner: 10CDanis) [19:27:54] (03CR) 10CDanis: [C:03+2] app/generic: add support for a metricsPort [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100515 (owner: 10CDanis) [19:28:47] (03Merged) 10jenkins-bot: app/generic copypatch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100514 (owner: 10CDanis) [19:29:06] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1491 to wikikube-worker1038 - kamila@cumin1002" [19:29:14] (03Merged) 10jenkins-bot: app/generic: add support for a metricsPort [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100515 (owner: 10CDanis) [19:29:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1491 to wikikube-worker1038 - kamila@cumin1002" [19:29:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:29:37] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1038 [19:29:42] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:29:43] (03CR) 10CDanis: [C:03+2] chart-renderer: use the metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100516 (https://phabricator.wikimedia.org/T379687) (owner: 10CDanis) [19:29:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1038 [19:30:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1491 to wikikube-worker1038 [19:30:33] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1493 to wikikube-worker1040 [19:31:12] (03Merged) 10jenkins-bot: chart-renderer: use the metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100516 (https://phabricator.wikimedia.org/T379687) (owner: 10CDanis) [19:33:16] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1492 to wikikube-worker1039 - kamila@cumin1002" [19:34:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1492 to wikikube-worker1039 - kamila@cumin1002" [19:34:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:34:04] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1039 [19:34:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1039 [19:34:39] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:34:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on mw1494:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:34:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1492 to wikikube-worker1039 [19:35:10] !log joal@deploy2002 Started deploy [airflow-dags/analytics@df2cac9]: Regular analytics weekly train [airflow-dags/analytics@df2cac98] [19:35:48] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [19:36:23] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [19:37:13] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1494 to wikikube-worker1041 [19:38:07] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1493 to wikikube-worker1040 - kamila@cumin1002" [19:38:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1493 to wikikube-worker1040 - kamila@cumin1002" [19:38:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:38:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal-main.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:38:46] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1040 [19:38:49] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:38:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1040 [19:39:06] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@df2cac9]: Regular analytics weekly train [airflow-dags/analytics@df2cac98] (duration: 03m 55s) [19:39:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1493 to wikikube-worker1040 [19:39:44] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:39:46] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:40:11] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [19:40:14] (03CR) 10Ryan Kemper: [C:03+1] wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:40:27] (03CR) 10Ssingh: [C:03+1] "Don't merge this yet." [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:40:29] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1495 to wikikube-worker1042 [19:40:55] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [19:42:52] (03PS1) 10Ryan Kemper: wdqs-internal: fix graph split conftool svc [puppet] - 10https://gerrit.wikimedia.org/r/1100524 (https://phabricator.wikimedia.org/T380555) [19:42:53] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1494 to wikikube-worker1041 - kamila@cumin1002" [19:43:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1494 to wikikube-worker1041 - kamila@cumin1002" [19:43:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:20] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1041 [19:43:20] (03CR) 10Ssingh: [C:03+1] "Nice find! Should work." [puppet] - 10https://gerrit.wikimedia.org/r/1100524 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:43:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1041 [19:43:33] (03CR) 10Bking: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1100524 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:43:33] (03PS2) 10Ryan Kemper: wdqs-internal: fix graph split conftool svc [puppet] - 10https://gerrit.wikimedia.org/r/1100524 (https://phabricator.wikimedia.org/T380555) [19:43:50] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:44:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1494 to wikikube-worker1041 [19:44:18] whose toes am I stepping on with netbox changes, and do you mind? [19:45:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T371742)', diff saved to https://phabricator.wikimedia.org/P71558 and previous config saved to /var/cache/conftool/dbconfig/20241204-194459-ladsgroup.json [19:45:03] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:45:16] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1496 to wikikube-worker1043 [19:45:48] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:46:27] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: fix graph split conftool svc [puppet] - 10https://gerrit.wikimedia.org/r/1100524 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:47:44] (03PS7) 10Ryan Kemper: wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) [19:47:44] (03PS6) 10Ryan Kemper: wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) [19:47:44] (03PS6) 10Ryan Kemper: wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) [19:49:02] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1495 to wikikube-worker1042 - kamila@cumin1002" [19:49:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1495 to wikikube-worker1042 - kamila@cumin1002" [19:49:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:49:22] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1042 [19:49:23] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [19:49:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1042 [19:50:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1495 to wikikube-worker1042 [19:51:48] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [19:52:43] !log sudo cumin "O:config_master" "run-puppet-agent" [19:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:58] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1496 to wikikube-worker1043 - kamila@cumin1002" [19:53:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1496 to wikikube-worker1043 - kamila@cumin1002" [19:53:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:53:03] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1043 [19:53:07] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [19:53:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1043 [19:53:36] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [19:53:41] RESOLVED: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-internal-main.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:53:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1496 to wikikube-worker1043 [19:55:01] !log T380555 Proceeding to step 5 of new lvs service process. Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094069 to enable lvs::realserver functionality [19:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:04] T380555: Enable LVS for wdqs-internal-[main,scholarly] - https://phabricator.wikimedia.org/T380555 [19:55:41] !log T380555 Running puppet on `wdqs2018` [19:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:52] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1038.eqiad.wmnet on all recursors [19:55:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1038.eqiad.wmnet on all recursors [19:57:01] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1038.eqiad.wmnet wikikube-worker1039.eqiad.wmnet wikikube-worker1040.eqiad.wmnet wikikube-worker1041.eqiad.wmnet wikikube-worker1042.eqiad.wmnet wikikube-worker1043.eqiad.wmnet on all recursors [19:57:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1038.eqiad.wmnet wikikube-worker1039.eqiad.wmnet wikikube-worker1040.eqiad.wmnet wikikube-worker1041.eqiad.wmnet wikikube-worker1042.eqiad.wmnet wikikube-worker1043.eqiad.wmnet on all recursors [19:58:39] (03PS1) 10Clare Ming: Metrics Platform Instrument/Experiment Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100527 [19:58:50] (03CR) 10BryanDavis: [C:03+1] "Cherry-pick updated on deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud and puppet run forced on deployment-mediawiki81.de" [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [20:00:04] (03PS1) 10Clare Ming: Metrics Platform Instrument/Experiment Configurator: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100528 [20:00:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P71559 and previous config saved to /var/cache/conftool/dbconfig/20241204-200006-ladsgroup.json [20:01:03] (03PS1) 10CDanis: app/generic: metricsPort: add to NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100529 [20:01:11] (03CR) 10CI reject: [V:04-1] app/generic: metricsPort: add to NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100529 (owner: 10CDanis) [20:03:33] (03PS2) 10CDanis: app/generic: metricsPort: add to NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100529 [20:04:51] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "T377876 - kamila@cumin1002" [20:04:54] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [20:04:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "T377876 - kamila@cumin1002" [20:05:04] (03PS1) 10Bartosz Dziewoński: MediaWiki: Ensure nice 404 instead of php-fpm 404 on auth domain [puppet] - 10https://gerrit.wikimedia.org/r/1100530 (https://phabricator.wikimedia.org/T380551) [20:05:06] (03PS1) 10Bartosz Dziewoński: MediaWiki: Define wikimedia.org portal on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) [20:05:08] (03PS1) 10Bartosz Dziewoński: MediaWiki: Redirect auth domain root to wikimedia.org portal [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) [20:05:10] (03PS1) 10Bartosz Dziewoński: MediaWiki: Remove duplicate ErrorDocument 404 from beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100533 [20:05:10] (03PS1) 10Bartosz Dziewoński: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T380551) [20:05:32] (03CR) 10CDanis: [C:03+2] app/generic: metricsPort: add to NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100529 (owner: 10CDanis) [20:07:01] (03Merged) 10jenkins-bot: app/generic: metricsPort: add to NetworkPolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100529 (owner: 10CDanis) [20:07:30] !log T380555 Disabling puppet on lvs hosts in preparation for merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094070 which will move `wdqs-internal-[main,scholarly]` from `service_setup` to `lvs_setup` [20:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:34] T380555: Enable LVS for wdqs-internal-[main,scholarly] - https://phabricator.wikimedia.org/T380555 [20:08:25] !log T380555 ran `ryankemper@cumin2002:~$ sudo -E cumin 'lvs*' 'disable-puppet T380555'` [20:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:59] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1038.eqiad.wmnet with OS bookworm [20:09:33] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [20:09:51] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [20:10:06] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1039.eqiad.wmnet with OS bookworm [20:10:48] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [20:12:38] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [20:12:44] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [20:12:51] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [20:12:55] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [20:15:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P71560 and previous config saved to /var/cache/conftool/dbconfig/20241204-201513-ladsgroup.json [20:16:10] (03PS1) 10Dbrant: push-notifications: Add no_proxy: localhost, for making API calls. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100535 (https://phabricator.wikimedia.org/T379647) [20:16:51] (03PS1) 10Andrew Bogott: codfw1dev cinder backups: change lifespan to 2 days [puppet] - 10https://gerrit.wikimedia.org/r/1100536 [20:17:05] !log T380555 Beginning lvs rolling restarts. first up `A:lvs-secondary-codfw` [20:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:08] T380555: Enable LVS for wdqs-internal-[main,scholarly] - https://phabricator.wikimedia.org/T380555 [20:17:33] !log T380555 `sudo -E cumin 'A:lvs-secondary-codfw' 'run-puppet-agent --force'` [20:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:41] (03CR) 10CDanis: [C:03+2] push-notifications: Add no_proxy: localhost, for making API calls. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100535 (https://phabricator.wikimedia.org/T379647) (owner: 10Dbrant) [20:17:57] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev cinder backups: change lifespan to 2 days [puppet] - 10https://gerrit.wikimedia.org/r/1100536 (owner: 10Andrew Bogott) [20:18:42] !log T380555 `sudo cookbook sre.loadbalancer.restart-pybal 'A:lvs-secondary-codfw' --reason 'rolling out new wdqs-internal-[main,scholarly] services'` [20:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:50] (03CR) 10Dzahn: [C:03+2] phabricator weekly changes email: Sort newcomers by claim date [puppet] - 10https://gerrit.wikimedia.org/r/1092205 (owner: 10Aklapper) [20:19:04] (03Merged) 10jenkins-bot: push-notifications: Add no_proxy: localhost, for making API calls. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100535 (https://phabricator.wikimedia.org/T379647) (owner: 10Dbrant) [20:20:46] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [20:20:50] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [20:20:58] (03CR) 10Dzahn: [C:03+2] "tested query but did not send a test mail, want one?" [puppet] - 10https://gerrit.wikimedia.org/r/1092205 (owner: 10Aklapper) [20:20:59] !log ryankemper@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs [20:21:28] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [20:21:29] !log T380555 `sudo cookbook sre.loadbalancer.restart-pybal --query 'A:lvs-secondary-codfw' --reason 'rolling out new wdqs-internal-[main,scholarly] services' restart_daemons` [20:21:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs [20:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:20] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [20:22:25] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 117 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [20:22:30] that's OK [20:22:39] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [20:23:07] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [20:23:15] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [20:23:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:23:43] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:23:56] (03PS2) 10Bartosz Dziewoński: MediaWiki: Ensure nice 404 instead of php-fpm 404 on auth domain [puppet] - 10https://gerrit.wikimedia.org/r/1100530 (https://phabricator.wikimedia.org/T380551) [20:24:42] !log T380555 hosts happily pooled and `sudo ipvsadm -L -n` shows `10.2.1.93` and `10.2.1.94` as expected), proceeding to `A:lvs-low-traffic-codfw` [20:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:45] T380555: Enable LVS for wdqs-internal-[main,scholarly] - https://phabricator.wikimedia.org/T380555 [20:25:12] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1038.eqiad.wmnet with reason: host reimage [20:25:43] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.loadbalancer.restart-pybal (exit_code=97) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [20:25:48] pybal looks unhappy on lvs1020 [20:26:30] ok [20:26:32] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1039.eqiad.wmnet with reason: host reimage [20:26:33] restarted [20:28:11] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1020.eqiad.wmnet [20:28:11] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1020.eqiad.wmnet [20:28:16] !log T380555 ran `sudo -E cumin 'A:lvs-low-traffic-codfw' 'run-puppet-agent --force'` [20:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:26] !log ryankemper@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs [20:28:28] !log T380555 `sudo cookbook sre.loadbalancer.restart-pybal --query 'A:lvs-low-traffic-codfw' --reason 'rolling out new wdqs-internal-[main,scholarly] services' restart_daemons` [20:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:37] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 119 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [20:28:41] ~cool [20:28:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1038.eqiad.wmnet with reason: host reimage [20:28:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10381305 (10cmooney) I'd hope we could avoid a lot of manual work and get this server set up using the new automation we are trying to build for Fundraising servers (see T37955... [20:28:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs [20:30:14] (03CR) 10Bartosz Dziewoński: "Cherry-picked on the beta cluster following these instructions: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code" [puppet] - 10https://gerrit.wikimedia.org/r/1100530 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [20:30:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T371742)', diff saved to https://phabricator.wikimedia.org/P71561 and previous config saved to /var/cache/conftool/dbconfig/20241204-203021-ladsgroup.json [20:30:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [20:30:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:30:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [20:30:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T371742)', diff saved to https://phabricator.wikimedia.org/P71562 and previous config saved to /var/cache/conftool/dbconfig/20241204-203043-ladsgroup.json [20:31:49] (03PS2) 10Bartosz Dziewoński: MediaWiki: Define wikimedia.org portal on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) [20:32:04] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs [20:32:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1039.eqiad.wmnet with reason: host reimage [20:33:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs [20:36:00] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [20:37:52] !log T380555 hosts happily pooled (except that `lvs2013` aka `A:lvs-low-traffic-codfw` cannot talk to `wdqs2026`) and `sudo ipvsadm -L -n` shows `10.2.1.93` and `10.2.1.94` as expected, codfw all done [20:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:55] T380555: Enable LVS for wdqs-internal-[main,scholarly] - https://phabricator.wikimedia.org/T380555 [20:39:04] (03PS3) 10Bartosz Dziewoński: MediaWiki: Define wikimedia.org portal on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) [20:39:06] (03CR) 10Bartosz Dziewoński: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) (owner: 10Bartosz Dziewoński) [20:41:36] (03CR) 10Bartosz Dziewoński: "Cherry-picked on the beta cluster following these instructions: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code" [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) (owner: 10Bartosz Dziewoński) [20:41:56] (03PS2) 10Bartosz Dziewoński: MediaWiki: Redirect auth domain root to wikimedia.org portal [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) [20:44:15] (03CR) 10Bartosz Dziewoński: "Cherry-picked on the beta cluster following these instructions: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code" [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [20:44:24] (03PS2) 10Bartosz Dziewoński: MediaWiki: Remove duplicate ErrorDocument 404 from beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100533 [20:45:52] (03CR) 10Bartosz Dziewoński: "Cherry-picked on the beta cluster following these instructions: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code" [puppet] - 10https://gerrit.wikimedia.org/r/1100533 (owner: 10Bartosz Dziewoński) [20:46:01] (03PS2) 10Bartosz Dziewoński: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T380551) [20:46:27] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument/Experiment Configurator: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100528 (owner: 10Clare Ming) [20:46:30] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument/Experiment Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100527 (owner: 10Clare Ming) [20:47:30] (03Merged) 10jenkins-bot: Metrics Platform Instrument/Experiment Configurator: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100528 (owner: 10Clare Ming) [20:47:37] (03Merged) 10jenkins-bot: Metrics Platform Instrument/Experiment Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100527 (owner: 10Clare Ming) [20:47:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1038.eqiad.wmnet with OS bookworm [20:49:19] (03PS1) 10Cathal Mooney: lvs2013: correct parent port for private1-b2-codfw vlan2029 int [puppet] - 10https://gerrit.wikimedia.org/r/1100540 (https://phabricator.wikimedia.org/T352784) [20:49:57] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:50:15] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:51:10] (03CR) 10Ssingh: [C:03+1] lvs2013: correct parent port for private1-b2-codfw vlan2029 int [puppet] - 10https://gerrit.wikimedia.org/r/1100540 (https://phabricator.wikimedia.org/T352784) (owner: 10Cathal Mooney) [20:51:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1039.eqiad.wmnet with OS bookworm [20:52:02] (03CR) 10Ryan Kemper: [C:03+1] lvs2013: correct parent port for private1-b2-codfw vlan2029 int [puppet] - 10https://gerrit.wikimedia.org/r/1100540 (https://phabricator.wikimedia.org/T352784) (owner: 10Cathal Mooney) [20:54:27] (03PS3) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [20:54:50] (03CR) 10Cathal Mooney: [C:03+2] lvs2013: correct parent port for private1-b2-codfw vlan2029 int [puppet] - 10https://gerrit.wikimedia.org/r/1100540 (https://phabricator.wikimedia.org/T352784) (owner: 10Cathal Mooney) [20:56:54] (03PS1) 10Bvibber: Enable Chart extension on several pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100544 (https://phabricator.wikimedia.org/T381436) [20:57:11] !log joal@deploy2002 Started deploy [analytics/refinery@7ba91e1]: Regular analytics weekly train - HOTFIX 2 [analytics/refinery@7ba91e13] [20:57:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100544 (https://phabricator.wikimedia.org/T381436) (owner: 10Bvibber) [20:59:00] !log joal@deploy2002 Finished deploy [analytics/refinery@7ba91e1]: Regular analytics weekly train - HOTFIX 2 [analytics/refinery@7ba91e13] (duration: 01m 48s) [20:59:06] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [20:59:19] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [20:59:45] !log joal@deploy2002 Started deploy [analytics/refinery@7ba91e1] (thin): Regular analytics weekly train THIN - HOTFIX 2 [analytics/refinery@7ba91e13] [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T2100). [21:00:04] greg-g and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] o/ here :D [21:00:17] !log joal@deploy2002 Finished deploy [analytics/refinery@7ba91e1] (thin): Regular analytics weekly train THIN - HOTFIX 2 [analytics/refinery@7ba91e13] (duration: 00m 31s) [21:00:39] !log joal@deploy2002 Started deploy [analytics/refinery@7ba91e1] (hadoop-test): Regular analytics weekly train TEST - HOTFIX 2 [analytics/refinery@7ba91e13] [21:01:08] !log joal@deploy2002 Finished deploy [analytics/refinery@7ba91e1] (hadoop-test): Regular analytics weekly train TEST - HOTFIX 2 [analytics/refinery@7ba91e13] (duration: 00m 29s) [21:01:25] is a deployer needed? [21:02:21] i can do mine myself in a pinch except i'm in a meeting ribght now :D [21:02:29] so that'd be welcome <3 [21:02:40] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [21:02:40] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=85) https://wikitech.wikimedia.org/wiki/PyBal [21:02:48] yes [21:03:05] (03CR) 10Cwhite: [C:03+2] prometheus: restart statsd-exporter on config change [puppet] - 10https://gerrit.wikimedia.org/r/1099822 (https://phabricator.wikimedia.org/T355837) (owner: 10Cwhite) [21:03:14] no worries [21:03:18] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:03:18] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: rebooting shortly [21:03:22] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:03:31] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: rebooting shortly [21:03:32] greg-g: you around? otherwise i'll start with Brooke's patch [21:04:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100544 (https://phabricator.wikimedia.org/T381436) (owner: 10Bvibber) [21:05:02] cjming: sorry! yes [21:05:04] (03Merged) 10jenkins-bot: Enable Chart extension on several pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100544 (https://phabricator.wikimedia.org/T381436) (owner: 10Bvibber) [21:05:26] sorry for being late, happy to wait my turn :) [21:05:33] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1100544|Enable Chart extension on several pilot wikis (T381436 T381312)]] [21:05:37] T381436: Enable Chart extension on mediawiki.org - https://phabricator.wikimedia.org/T381436 [21:05:38] T381312: Enable Charts extension on Swedish, Italian, Hebrew Wikipedia - https://phabricator.wikimedia.org/T381312 [21:05:46] !log T380555 Moving `wdqs-internal-[main,scholarly]` services into prod by merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094074 [21:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:49] T380555: Enable LVS for wdqs-internal-[main,scholarly] - https://phabricator.wikimedia.org/T380555 [21:05:53] no worries! window should go quick with just config patches in the queue [21:05:55] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: bring graph split into production [puppet] - 10https://gerrit.wikimedia.org/r/1094074 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [21:06:16] (03CR) 10Cwhite: [C:03+2] webperf: set statsv.py --statsd to statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1099720 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [21:08:12] cjming: please ping me when backports are done. I missed the train deployment window 🤦‍♀️ [21:08:27] jeena: ack - will do [21:09:02] !log T380555 Rolling out prod change => `ryankemper@cumin2002:~$ sudo cumin -b 8 'A:dnsbox' 'run-puppet-agent'` [21:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:43] 10ops-codfw, 06SRE, 06DC-Ops: Remove defunct lvs cross-dc links in Netbox (lvs2011 & lvs2013) - https://phabricator.wikimedia.org/T381533 (10cmooney) 03NEW p:05Triage→03Low [21:12:17] bvibber: on mwdebug - testable? [21:12:21] lemme test [21:12:48] (03CR) 10Bartosz Dziewoński: "Cherry-picked on the beta cluster following these instructions. I think it works? I was a bit confused for a while, since it didn't seem t" [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [21:13:00] !log cjming@deploy2002 cjming, bvibber: Backport for [[gerrit:1100544|Enable Chart extension on several pilot wikis (T381436 T381312)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:06] T381436: Enable Chart extension on mediawiki.org - https://phabricator.wikimedia.org/T381436 [21:13:07] T381312: Enable Charts extension on Swedish, Italian, Hebrew Wikipedia - https://phabricator.wikimedia.org/T381312 [21:13:18] cjming: looks good [21:13:56] (03CR) 10Tchanders: Ensure IP reveal buttons are not shown on Special:MassGlobalBlock (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100150 (https://phabricator.wikimedia.org/T124607) (owner: 10Tchanders) [21:14:12] (03CR) 10Bartosz Dziewoński: "Anyway, this one is more complex than the rest of the stack, and it will affect production and not just the beta cluster, so careful revie" [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [21:14:18] !log cjming@deploy2002 cjming, bvibber: Continuing with sync [21:14:37] (03CR) 10Ryan Kemper: [C:03+2] "See here for fix patch we needed to ship to bring service.yaml state into alignment with what we'd had in conftool-data in the previous pa" [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) (owner: 10Ryan Kemper) [21:14:46] (03PS3) 10Pcoombe: CSP for banner preview: allow remind me later SMS host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier) [21:15:56] (03CR) 10Cwhite: [C:03+2] webperf: disable statsd-exporter relaying flag [puppet] - 10https://gerrit.wikimedia.org/r/1099796 (https://phabricator.wikimedia.org/T355837) (owner: 10Cwhite) [21:17:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1045.eqiad.wmnet with OS bookworm [21:17:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381503 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm [21:18:10] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal-main [21:18:17] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal-scholarly [21:19:09] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: add graph split disc DNS records [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) (owner: 10Bking) [21:19:40] cjming: ready when you are [21:19:48] (03CR) 10Ryan Kemper: [C:03+2] "Forgot to write in commit message but this was step 9 (the final step) of the lvs add a new service process" [dns] - 10https://gerrit.wikimedia.org/r/1100165 (https://phabricator.wikimedia.org/T379334) (owner: 10Bking) [21:20:17] greg-g: just waiting for bvibber's patch to finish syncing - any minute now [21:20:29] ah, wasn't sure, coolio [21:20:48] (just saw the rebase so thought you were ready ready ;) ) [21:21:06] !log T379334 Final step (step 9) of spinning up these new services; merged https://gerrit.wikimedia.org/r/c/operations/dns/+/1100165/, next up is the authdns update [21:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:10] T379334: Create DNS records for wdqs-internal-main and wdqs-internal-scholarly - https://phabricator.wikimedia.org/T379334 [21:22:03] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet [21:23:02] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100544|Enable Chart extension on several pilot wikis (T381436 T381312)]] (duration: 17m 29s) [21:23:07] T381436: Enable Chart extension on mediawiki.org - https://phabricator.wikimedia.org/T381436 [21:23:08] T381312: Enable Charts extension on Swedish, Italian, Hebrew Wikipedia - https://phabricator.wikimedia.org/T381312 [21:23:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier) [21:23:26] bvibber: should be live! [21:23:32] cjming: thx! [21:23:38] yw [21:23:53] !log T379334 `ryankemper@dns1004:~$ sudo -i authdns-update` completed [21:24:35] highfive to bvibber for being swat window buddies [21:24:39] :) [21:24:59] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2013.codfw.wmnet [21:25:07] greg-g: \o [21:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:20] PROBLEM - Host lvs2013 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:40] RECOVERY - Host lvs2013 is UP: PING OK - Packet loss = 0%, RTA = 33.29 ms [21:25:40] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [21:25:47] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093401|CSP for banner preview: allow remind me later SMS host (T380232)]] [21:25:50] T380232: Add app.goacoustic.com to wikipedia.org Content Security Policy (CSP) - https://phabricator.wikimedia.org/T380232 [21:26:06] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache wdqs-internal-main.discovery.wmnet on all recursors [21:26:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs-internal-main.discovery.wmnet on all recursors [21:26:14] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache wdqs-internal-scholarly.discovery.wmnet on all recursors [21:26:17] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs-internal-scholarly.discovery.wmnet on all recursors [21:26:20] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:26:48] these pybal errors ok? [21:27:09] yes please [21:27:23] that host is drained, we are bringing it back up and should go away [21:27:40] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=85) https://wikitech.wikimedia.org/wiki/PyBal [21:28:05] sukhe: cool, so OK to proceed with deploys? [21:28:42] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:28:55] greg-g: please do, it's up now (and should not affect it regardless of that) [21:29:00] cool [21:29:03] thanks for checking [21:29:20] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:29:24] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:31:29] greg-g: up on test servers if testable [21:31:58] cjming: testing [21:32:01] !log cjming@deploy2002 cjming, gjg: Backport for [[gerrit:1093401|CSP for banner preview: allow remind me later SMS host (T380232)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:04] T380232: Add app.goacoustic.com to wikipedia.org Content Security Policy (CSP) - https://phabricator.wikimedia.org/T380232 [21:32:20] k8s-mwdebug? [21:32:40] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 85 connections established with conf2004.codfw.wmnet:4001 (min=85) https://wikitech.wikimedia.org/wiki/PyBal [21:32:43] mwdebug - yes [21:34:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:34:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1086.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:34:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1086.eqiad.wmnet with OS bullseye [21:35:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-be1086.eqiad.wmnet with OS bullseye [21:35:29] hmm, not sure, I'm still getting the CSP policy violation error, but not sure if that's because of how things are setup on mwdebug and csp [21:36:00] do you want to abort or continue? [21:36:54] can you continue and I'll get the security team to review the state? the worst case is that we just didn't open it far enough [21:37:04] sure thing [21:37:08] !log cjming@deploy2002 cjming, gjg: Continuing with sync [21:37:09] ie: if anything it just means we're still locked down too much [21:40:18] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:43:27] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093401|CSP for banner preview: allow remind me later SMS host (T380232)]] (duration: 17m 39s) [21:43:30] T380232: Add app.goacoustic.com to wikipedia.org Content Security Policy (CSP) - https://phabricator.wikimedia.org/T380232 [21:43:53] greg-g: should be live :) [21:43:54] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be - jclark@cumin1002" [21:43:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be - jclark@cumin1002" [21:43:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:44:44] cjming: thanks! I'll follow-up with security and fundraising on this. All good for now! [21:45:00] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es1045 [21:45:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1086.eqiad.wmnet with reason: host reimage [21:46:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1045 [21:46:29] great - closing window then [21:46:32] !log end of UTC late backport window [21:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:39] jeena: all yours [21:46:45] thank you [21:47:42] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100547 (https://phabricator.wikimedia.org/T375665) [21:47:44] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100547 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [21:48:22] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100547 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [21:49:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1086.eqiad.wmnet with reason: host reimage [21:54:28] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:32] cjming: just to say it, my test case was old, got a new banner and it worked, all good! [21:57:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1044.eqiad.wmnet with OS bookworm [21:57:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [21:57:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm ex... [21:57:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381631 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm ex... [21:59:34] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.6 refs T375665 [21:59:37] T375665: 1.44.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T375665 [22:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241204T2200) [22:03:01] (03PS2) 10Thcipriani: Reinstate the banner for the developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [22:03:01] (03CR) 10Thcipriani: "Got you the privacy link, I'll get the survey link Soon™ Thank you for this ❤️" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [22:05:23] (03Abandoned) 10Thcipriani: Add a banner for the 2024 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100162 (https://phabricator.wikimedia.org/T351109) (owner: 10Thcipriani) [22:06:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:10:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T371742)', diff saved to https://phabricator.wikimedia.org/P71563 and previous config saved to /var/cache/conftool/dbconfig/20241204-221001-ladsgroup.json [22:10:05] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:11:44] (03PS1) 10Eevans: cassandra: configurations merged from upstream 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1100549 (https://phabricator.wikimedia.org/T380420) [22:12:01] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:12:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:12:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1086.eqiad.wmnet with OS bullseye [22:12:32] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-be1086.eqiad.wmnet with OS bullseye complete... [22:13:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381665 (10Jclark-ctr) [22:13:26] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1040.eqiad.wmnet with OS bookworm [22:13:46] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1041.eqiad.wmnet with OS bookworm [22:16:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1044.eqiad.wmnet with OS bookworm [22:17:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm [22:18:09] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1085.eqiad.wmnet with OS bullseye [22:18:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-be1085.eqiad.wmnet with OS bullseye [22:25:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P71564 and previous config saved to /var/cache/conftool/dbconfig/20241204-222509-ladsgroup.json [22:26:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [22:26:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm [22:29:25] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1040.eqiad.wmnet with reason: host reimage [22:30:00] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1041.eqiad.wmnet with reason: host reimage [22:32:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1040.eqiad.wmnet with reason: host reimage [22:33:13] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1044.eqiad.wmnet with reason: host reimage [22:34:19] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1042.eqiad.wmnet with OS bookworm [22:34:44] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1043.eqiad.wmnet with OS bookworm [22:35:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1041.eqiad.wmnet with reason: host reimage [22:37:44] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1045.eqiad.wmnet with OS bookworm [22:37:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm ex... [22:38:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1044.eqiad.wmnet with reason: host reimage [22:40:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P71565 and previous config saved to /var/cache/conftool/dbconfig/20241204-224016-ladsgroup.json [22:45:55] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10381717 (10Dzahn) Cool! I tested the downgrade and upgrade with APT as well on lists2001. Worked both ways. [22:50:26] (03CR) 10Cwhite: [C:03+2] webperf: set statsd exporter timer type to histogram (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1099821 (https://phabricator.wikimedia.org/T355837) (owner: 10Cwhite) [22:50:37] (03PS3) 10Cwhite: webperf: set statsd exporter timer type to histogram [puppet] - 10https://gerrit.wikimedia.org/r/1099821 (https://phabricator.wikimedia.org/T355837) [22:50:37] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1042.eqiad.wmnet with reason: host reimage [22:50:38] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1043.eqiad.wmnet with reason: host reimage [22:51:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1040.eqiad.wmnet with OS bookworm [22:54:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1042.eqiad.wmnet with reason: host reimage [22:54:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1041.eqiad.wmnet with OS bookworm [22:55:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T371742)', diff saved to https://phabricator.wikimedia.org/P71566 and previous config saved to /var/cache/conftool/dbconfig/20241204-225523-ladsgroup.json [22:55:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [22:55:26] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:55:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [22:55:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T371742)', diff saved to https://phabricator.wikimedia.org/P71567 and previous config saved to /var/cache/conftool/dbconfig/20241204-225545-ladsgroup.json [22:56:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:57:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1043.eqiad.wmnet with reason: host reimage [22:58:54] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:04:27] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:34] FIRING: [14x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:06:12] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538 (10jhathaway) 03NEW [23:06:25] ^dcatap alerts are from stale systemd units that need to be cleaned up. the probedown on wdqs1026 i’ll investigate when back near computer [23:06:26] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10381764 (10jhathaway) p:05Triage→03Low [23:08:51] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10381775 (10jhathaway) [23:10:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [23:10:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381781 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm ex... [23:13:32] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:13:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1042.eqiad.wmnet with OS bookworm [23:16:26] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:16:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1043.eqiad.wmnet with OS bookworm [23:20:54] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [23:21:35] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [23:26:36] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1085.eqiad.wmnet with OS bullseye [23:26:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381819 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-be1085.eqiad.wmnet with OS bullseye executed... [23:32:15] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [23:32:44] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [23:32:50] (03CR) 10Cwhite: [V:03+2 C:03+2] webperf: set statsd exporter timer type to histogram [puppet] - 10https://gerrit.wikimedia.org/r/1099821 (https://phabricator.wikimedia.org/T355837) (owner: 10Cwhite) [23:35:07] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [23:35:36] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [23:39:42] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [23:40:36] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [23:42:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:42:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1044.eqiad.wmnet with OS bookworm [23:42:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm co... [23:43:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381848 (10Jclark-ctr) [23:43:40] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1085.eqiad.wmnet with OS bullseye [23:43:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-be1085.eqiad.wmnet with OS bullseye [23:47:04] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [23:47:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm [23:54:36] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1085.eqiad.wmnet with reason: host reimage [23:57:37] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1085.eqiad.wmnet with reason: host reimage [23:59:55] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 23:00:00 on 8 hosts with reason: T376150 non-prod hosts [23:59:58] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150